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Abstract.  This  work  is  a  survey  of  the  average  cost  control  problem  for  discrete-time  Markov 
processes.  We  have  attempted  to  put  together  a  comprehensive  account  of  the  considerable 
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several  important  questions  which  are  still  left  open  to  investigation. 
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§1.  Introduction 

The  average  cost  criterion  (equivalently,  the  long  run  average  or  ergodic  cost)  is  a  popular 
criterion  for  optimization  of  stochastic  dynamical  systems  over  an  infinite  time  horizon.  It 
is  a  reasonable  criterion  to  use  when  the  anticipated  time  interval  for  optimization  (which  in 
practice  is  finite)  is  long  compared  to  other  time  scales  involved  and  there  are  no  compelling 
reasons  to  prefer  short  term  optimization  over  long  term.  Naturally,  it  is  not  favored  in 
financial  applications  where  money  spent  now  is  worth  more  than  money  spent  later,  but 
there  are  situations  (communication  networks  being  a  prime  example)  where  a  ‘steady  state’ 
operation  is  expected  over  intervals  long  compared  to  the  time  constants  of  the  system.  Then 
it  makes  sense  to  minimize  the  limiting  time-averaged  cost,  i.e.,  the  ‘average  cost.’ 

Mathematically,  the  criterion  stands  out  as  being  much  more  difficult  to  analyze  than 
others;  while  other  classical  criteria  lead  to  reasonably  complete  solutions,  the  average  cost 
does  not.  The  finite  state  and  action  problem  is  well  understood,  but  there  are  numerous 
counterexamples  in  which  infinite  state  or  action  problems  do  not  have  a  nice  solution.  In 
fact,  it  appears  not  as  a  single  problem  but  a  collection  of  problems,  some  of  which  do  not 
have  a  nice  solution  (cf.  [146]).  Thus,  a  variety  of  approaches  have  been  developed  to  handle 
different  situations.  Not  surprisingly,  this  is  one  chapter  of  Markov  decision  theory  which 
is  anything  but  closed.  At  the  same  time,  it  has  come  of  age,  having  been  studied  for  over 
30  years,  with  promises  of  significant  advances  on  the  horizon.  This,  in  short,  is  the  raison 
d’etre  for  this  survey;  we  have  attempted  to  put  together  a  coherent  account  of  what  has 
been  done,  with  an  indication  of  what  future  advances  may  be. 

Any  such  project  has  obvious  limitations.  Space  constraints  dictate  a  certain  amount 
of  selection,  and  not  every  relevant  work  can  be  covered  in  significant  detail.  We  have 
included  proofs  where  we  felt  they  were  essential  to  understanding  the  results  or  contained 
potentially  useful  novel  ideas.  In  all  cases,  a  serious  attempt  at  objectivity  has  been  made. 
For  other  complementary  reading  on  the  general  subject  of  Markov  decision  theory,  see 
[133],  [177],  [192],  [203]. 

The  paper  is  organized  as  follows:  Section  2  describes  the  problem  formulation  in  full 
detail.  Section  3  gives  a  brief  history.  Sections  4,  5  and  6  treat  extensively  the  finite 
state,  the  countable  state  and  the  Borel  state  space  cases  respectively,  under  complete 
observations.  Section  7  treats  the  problem  under  partial  observations.  Section  8  describes 
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some  recent  results  on  multiobjective  average  cost  control.  Finally,  we  conclude  with  some 
relevant  remarks. 

§2.  Preliminaries  and  Formulation  of  the  Problem 

In  this  section,  the  model  and  basic  results  concerning  controlled  Markov  processes  are 
given  in  the  most  general  form  needed  for  our  presentation.  In  some  subsequent  sections, 
we  specialize  our  presentation  to  situations  in  which  measure-theoretic  aspects  are  of  no 
essential  concern,  as  in  the  case  for  models  with  countable  state  space,  allowing  for  a  more 
transparent  exposition.  Before  presenting  the  model  we  summarize  our  key  notation: 

•  R:  set  of  real  numbers. 

•  N:  set  of  positive  integers. 

•  No:  set  of  nonnegative  integers. 

•  B(W):  Borel  a- algebra  of  a  given  topological  space  W . 

•  'P(W):  for  a  Borel  space  W  (see  [15]),  the  set  of  all  probability  measures  on  B(W), 
endowed  with  the  topology  of  weak  convergence  (see  [130]). 

Function  spaces  on  a  topological  space  W : 

•  Cb{W)  :=  {v  :  W  — ►  R  |  v  is  continuous  and  bounded}. 

•  M(W)  :=  {u  :  W  — *  R  |  v  is  Borel  measurable}. 

•  Mb(W)  :=  (u  :  W  — >  R  |  v  is  Borel  measurable  and  bounded}. 

•  C(W)  :=  {w  :  W  — *  R  |  v  is  lower  semicontinuous  and  bounded  below}. 

.  Cb(W)  ;=  £(W)f|.M6(W). 

For  v  6  Mb(W)  we  let: 

•  IMI  :=  suPu;€VV{|u(u;)|}. 

•  span(u)  :=  supW)t„,6Wr{t>(t«)  -  r(w')}. 

•  v+  :=  v  -  inftu6w{v(u;)},  v~  :=  v  -  su-pwew  {v(w)} . 

We  refer  to  span(u)  as  the  span  seminorm  of  v. 

The  following  is  a  list  of  the  acronyms  used  in  this  paper  (the  subsection  where  each 
acronym  is  first  introduced  is  indicated  in  parenthesis): 

•  AC  average  cost  (Sect.  2.4) 

•  ACOE  average  cost  optimality  equation  (Sect.  3) 

•  ACOI  average  cost  optimality  inequality  (Sect.  5.2) 

•  CMP  controlled  Markov  process  (Sect.  2.1) 

•  CO  completely  observable  (Sect.  3) 

•  DC  ciscounted  cost  (Sect.  2.4) 

•  DCOE  ciscounted  cost  optimality  equation  (Sect.  2.6) 

•  PO  partially  observable  (Sect.  7.2) 

•  POCMP  partially  observable  controlled  Markov  process  (Sect.  3) 

•  TC  total  cost  (Sect.  2.4) 

2.1.  The  model.  A  discrete  time,  stationary  controlled  Markov  process  (CMP),  or  Markov 
decision  process,  is  a  stochastic  dynamical  system  specified  by  the  five-tuple  (5,  A,  U,P,c ), 
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where: 


(a)  5  is  a  Borel  space,  called  the  state  space,  the  elements  of  which  are  called  states. 

(b)  A  is  a  Borel  space,  called  the  action  or  control  space. 

(c)  U  :  S  -+  5(A)  is  a  strict,  measurable,  compact-valued  multifunction  (see  the  Ap¬ 
pendix).  U(x)  represents  the  set  of  admissible  actions  (or  control  inputs)  when  the 
system  is  in  state  x  £  5 .  Accordingly,  the  set  of  admissible  state/ action  pairs  is 
K  :=  {(x, a)  :  x  £  S,  a  £  U(x)}  -  Graph (U),  and  we  have  that  K  £  5(5  x  A). 
This  set  is  endowed  with  the  subspace  topology  corresponding  to  6(5  x  A). 

(d)  P  is  a  stochastic  kernel  on  5  given  K ,  called  the  transition  kernel.  It  is  assumed 
to  be  Borel  measurable,  i.e.,  P{D  \  ■)  :  K  [0,1]  is  Borel  measurable,  for  each 
D  £  6(5). 

(e)  c  :  K  — >  R  is  the  (measurable)  one-stage  cost  function. 

The  evolution  of  the  system  is  as  follows.  Let  Xt  denote  the  state  at  time  t  £  No,  and  At 
the  action  chosen  at  that  time.  If  Xt  —  x  £  S  and  At  =  a  £  U(x ),  then:  (i)  a  cost  c(x,a) 
is  incurred,  and  (ii)  the  system  moves  to  the  next  state  Xt+ 1,  according  to  a  probability 
distribution  P(-  |  x,a).  Once  the  transition  into  the  next  state  has  occurred,  a  new  action 
is  chosen,  and  the  process  is  repeated. 

The  total  period  of  time  over  which  the  system  is  to  be  observed  is  called  the  planning 
(or  decision-making  or  control)  horizon  and  is  denoted  by  T.  It  can  be  a  finite  interval 
(0, . . . ,  N  -  1),  with  N  £  N,  or  an  infinite  horizon,  e.g.  T  =  N0. 

Example  2.1.  Let  5,  A,  W  be  Borel  spaces,  and  F:  SxAxW-^-Sa.  Borel  Junction. 
Consider  a  nonlinear  stochastic  system  described  by  the  system  equation 

Xt+i  =  F(Xt,At,Wt),  ter, 

where  the  process  {Wt}  is  a  sequence  of  independent  and  identically  distributed  (i.i.d.)  W- 
valued  random  variables,  with  common  probability  distribution  Vw,  often  referred  to  as  a 
stochastic  state  disturbance,  or  noise;  {VC*}  is  assumed  to  be  independent  of  Ao-  Suppose 
that  a  strict,  measurable,  compact- valued  multifunction  U  :  S  — ►  6(A)  has  been  specified, 
giving  the  necessary  constraints  on  the  control  actions,  or  that  U(x)  =  A,  for  all  x  £  5, 
if  there  are  no  constraints.  Then  the  evolution  of  the  system  is  equivalently  described  in 
terms  of  the  stochastic  kernel  P  on  5  given  K  defined  as 

P{D  |  x,a)  :=  f  l{F{x,a,w)  £  D}Vw(dw),  (x,a)  €  K  ,  D  €  6(5) , 

Jw 

where  /{A}  denotes  the  indicator  function  of  the  event  .4.  The  additional  specification  of 
a  measurable  cost  function  c  :  K  —■ ►  R  would  completely  define  a  CMP  (5,  A,U,  P,c) . 

Example  2.2.  Consider  a  countable  set  5  endowed  with  the  discrete  topology.  With  no  loss 
in  generality  we  can  take  5  =  No-  Let  A  be  a  Borel  space  and  U{x)  =  A,  for  all  x  £  S.  In 
this  case,  every  stochastic  kernel  on  No  given  K  :=  No  X  A  reduces  to  a  collection  of  discrete 
probability  distributions  parameterized  by  ( i,a )  £  K.  These  can  also  be  represented  by 
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a  collection  of  stochastic  matrices  {P{a)  =  [ptj(a)]}a€A,  i.e.,  P(a)  is  a  state  transition 
matrix,  and  Pij(a)  is  the  probability  that  the  state  of  the  system  makes  a  transition  from 
i  to  j,  under  action  a.  Therefore,  additionally  specifying  a  cost  function  c  :  No  X  A  — *■  R 
completely  defines  a  CMP. 

The  (admissible)  history  spaces  are  defined  as 

H0:=S  ,  Ht  :=  Ht^  X  K  ,  t£  N0  , 
and  the  canonical  sample  space  is  defined  as 

(Sx  A)°°. 

These  spaces  are  endowed  with  their  respective  product  topologies,  and  are  therefore  Borel 
spaces.  A  generic  element  u  £  ft  is  of  the  form  u  =  (xo,a0,xi,ai, . . . ),  X{  £  S,  cq  £  A;  all 
random  variables  will  be  defined  on  the  measurable  space  (ft,8(ft)). 

The  state,  action  (or  control)  and  information  processes,  denoted  by  {Xt}t£T,  {A t}teT 
and  {Ht}teT,  respectively,  are  defined  by  the  projections 

Xt(w):=Xi,  At(u>)  :=  at ,  Ht(u)  :=  (x0, . . .  ,at-i,xt) ,  *eT, 

for  each  realization  ui  =  (xo»  • . . ,  at-i,xt,  at, . . . )  £  ft.  Since  B(ft)  =  (B(S)  x  B(A))°°,  the 
above  are  well-defined  random  processes  on  (ft,8(ft)).  Note  that  8(ft)  =  where 

3 't  =  the  cr-algebra  generated  by  Ht- 

2.2.  Policies  and  performance  criteria.  An  admissible  control  strategy,  or  policy,  is  a 
sequence  n  =  {ftt}t£T  of  Borel  measurable  stochastic  kernels  on  A  given  Ht,  satisfying  the 
constraint 

7Ti(U(xt)  |  ht)  =  1 ,  xt  6  S  ,  ht£Ht. 

The  set  of  all  admissible  policies  will  be  denoted  by  II. 

If  p  £  "P(S)  and  it  £  II  are  given,  there  exists  a  unique  probability  measure  V*  on 
(ft,  8(ft))  satisfying  the  following  [15,  Prop.  7.28,  pp.  140-144],  [126,  Prop.  V.1.1,  pp.  162— 
164],  with  D  £  B(S )  and  C  £  B(A), 

(2.1)  v;(x0£D)  =  p(D), 

(2.2)  Vl{At  £C\Ht)  =  t rt(C  \Ht),  V^s. 

(2.3)  VZ(Xt+i  e  D  |  Ht,At)  =  P(D  |  Xt,At) ,  P;-a.s. 

Therefore,  if  p,  is  the  distribution  of  the  initial  state  Xq,  and  policy  tv  £  II  is  used,  the 
underlying  probability  space  of  all  random  variables  of  interest  is  (ft,B(ft),P^).  The 
expectation  operator  with  respect  to  V *  will  be  denoted  by  E*.  Furthermore,  if  p  is  a 
Dirac  measure  at  x  £  S,  we  will  simply  write  V*  and  E*- 

Certain  classes  of  admissible  policies  are  of  special  interest.  A  policy  7r  is  called  a  Markov 
randomized  policy  if  there  exists  a  sequence  of  measurable  maps  { ft}teT >  called  randomized 
decision  rules,  where  ft  :  S  —*  Pi  A),  for  each  t  €  T,  such  that 

TVt(-\Ht)  =  ft{Xt)(-),  p;-a.s. 
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Conversely,  every  sequence  of  measurable  maps  ft  :  S  — +  'P(A),  t  €  T,  satisfying 
ft(x)(U(x))  =  1,  defines  a  Markov  randomized  policy  in  an  obvious  way;  with  some  abuse 
in  notation  the  sequence  itself  will  be  referred  to  as  the  policy.  The  set  of  all  Markov 
randomized  policies  will  be  denoted  by  11^/-  A  policy  {ft} ter  €  II m  is  called  a  stationary 
randomized  policy  if  there  is  a  randomized  decision  rule  /  such  that,  for  all  t  G  T,  ft  =  f. 
The  set  of  all  stationary  randomized  policies  will  be  denoted  by  IIsh-  A  nonrandomized, 
deterministic  or  pure  decision  rule  is  a  measurable  map  /  :  S  — ►  A.  A  policy  {/t}tex  €  II ax 
is  called  a  nonrandomized,  deterministic  or  pure  Markov  policy  if  each  ft  is  deterministic. 
Hence,  in  this  case  At  =  ft(Xt)  a.s.  The  set  of  deterministic  Markov  policies  will  be  denoted 
by  TlMD.  Stationary  deterministic  policies  are  defined  in  the  obvious  way.  The  set  of  all 
stationary  deterministic  policies  is  denoted  by  Hsd,  and  for  n  €  Hjc,  7r(z)  will  denote  the 
action  chosen  at  x  G  S.  Clearly  Hsd  C  II md  Q  C  II,  and  Hs#  C  ns/*  C  Hm- 

It  is  easily  seen  that  under  a  policy  tt  =  {ft}teT  €  n^,  the  state  process  {Xt}t£T  is  a 
Markov  process.  That  is,  for  D  €  B(S), 


VI  {Xt+1  €  D  |  Xt,...,X0)  =  v;{xt+1  6  D  \  Xt) 

=  J  P(D  |  Xt,a)  ft(Xt)(da) ,  7>£-a.s., 

and  under  a  policy  it'  <G  IIsr,  {Xt}teT  is  a  Markov  process  with  stationary  transition 
probabilities. 

Each  policy  ir  €  II  incurs  a  stream  of  random  costs,  e.g.  {c(Xt, ft(Xt))} t^T,  for 
{ft]teT  €  UMD.  Depending  upon  the  problem  requirements  several  cost  evaluation  cri¬ 
teria  are  studied.  The  following  criteria  are  frequently  used. 


Total  Cost  (TC).  The  total  cost  incurred  by  the  policy  7r  €  II  over  the  entire  planning 
horizon  is  given  by 


K 


Ltex  J 

When  the  horizon  is  finite,  i.e.,  T  =  {0,  ...,1V  —  1},  N  £  No,  we  denote  the  above  more 
explicitly  as  J^(p,7r).  Furthermore,  given  a  terminal  cost  function  h  6  A 1  (,(£'),  we  define 


JN{p,ir,h)  El 


7V-1 


£  c(Xt,At)  +  h(XN) 


L  £= 0 


Discounted  Cost  (DC).  Let  0  <  (3  <  1,  the  discount  factor,  and  r  6  n  be  given.  The 
total  discounted  cost  incurred  by  7T  over  the  infinite  planning  horizon  is  given  by 


F n 


£  Ptc(Xt,At) 


u=o 


Average  Cost  (AC).  The  expected  long-run  average  cost  incurred  by  7T  £  n  is  given  by 

’  1  N-l 

JJ  c(xti  At) 


:=  limsup  E* 


Ar— oo 


<=0 


-  lim  sup  —  Ja(a£.  7r) . 
V— oo  A 
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Sample  Path  Average  Cost.  This  is  a  pathwise  version  of  the  AC  and,  for  Xo  =  x ,  it 
is  given  by 


where  {X*}  and  (At)  are  the  state  and  control  process  induced  by  7T  €  II.  Here,  Js(x,x)  is 
to  be  regarded  as  an  extended  real-valued  random  variable  on  the  canonical  sample  space. 


For  the  AC  criterion,  the  limit  of  the  expected  average  cost  may  not  exist  for  some  or 
all  policies  7r  €  II,  and  thus  the  limit  superior  is  used.  This  is  always  well-defined  and 
captures  the  worst  possible  asymptotic  expected  average  performance  under  policy  7r  €  II, 
i.e.,  it  gives  a  ‘pessimistic’  measure  of  performance.  On  the  other  hand,  the  limit  inferior 
could  also  be  used,  which  would  yield  an  ‘optimistic’  measure  of  performance  by  capturing 
the  best  possible  asymptotic  expected  average  performance.  The  planning  horizon  for  the 
TC  criterion  can  be  finite  or  infinite,  whereas  for  the  other  criteria  above  it  is  always 
infinite.  Under  certain  conditions,  it  can  be  shown  that  a  problem  with  the  DC  criterion  is 
equivalent  to  one  with  a  TC  criterion,  with  a  random  (finite)  horizon,  see  [39,  pp.  31-32]. 
Also,  it  can  be  shown  that  for  each  7r  €  II,  a  policy  zr'  G  IT m  can  be  found  such  that 
E*  [c(Xt,  At)]  =  E*'  [c(Xt,  At)] ,  for  each  t  £  No  and  any  initial  distribution  p  £  V(S)  [41], 
[50,  Sect.  3.8].  Thus,  for  criteria  that  are  determined  by  these  expected  costs,  such  as  the 
AC,  DC,  and  TC  criteria,  it  suffices  to  consider  policies  in  IIm- 

For  an  infinite  planning  horizon,  Jx{p,n)  need  not  be  well-defined  or  may  be  infinite  for 
all  tv  £  IT,  rendering  this  criterion  useless  for  comparing  the  performance  under  different 
policies.  Therefore,  the  DC  or  AC  criteria  are  usually  selected  when  the  planning  horizon 
is  infinite.  When  the  DC  criterion  is  used,  a  rather  complete  theory  is  available  for  the 
corresponding  dynamic  programming  formulation  of  the  problem  [14],  [15],  [50],  [80],  [100], 
[146],  [196].  In  this  situation,  future  costs  are  discounted  at  a  fixed  rate  0  <  (3  <  1, 
and  therefore,  if  (3  is  not  sufficiently  close  to  1,  the  asymptotic  behavior  of  the  state/cost 
process  may  not  be  important  at  all.  Quite  the  opposite  is  the  case  with  the  AC  criterion, 
under  which  all  decision  times  are  given  equal  weight  and  one  takes  the  limit  of  time- 
averaged  expected  costs.  The  finite  time  evolution  of  the  state/cost  process  is,  in  some 
sense,  completely  irrelevant  in  this  case,  and  some  sort  of  asymptotic  stable  behavior  is 
desired,  making  this  case  mathematically  much  more  involved  than  the  previous  one.  Hence, 
the  DC  and  AC  can  be  seen  as  two  opposite  extremes  in  the  spectrum  of  possible  criteria 
that  can  be  considered,  in  the  sense  that  the  first  one  captures  primarily  the  performance 
of  the  process  at  the  present  and  near  future,  and  the  second  captures  the  performance  at 
the  distant  future. 

2.3.  The  optimal  control  problem.  The  optimal  control  (or  decision)  problem  is  that 
of  selecting  an  admissible  policy  such  that  a  given  performance  criterion  is  minimized  over 
all  admissible  policies.  For  example,  for  the  DC  criterion,  a  policy  n*  £  n  is  said  to  be 
(/?)-discount  ^-optimal  for  the  initial  distribution  p  if 


jp(p.iX*)  <  Jp(p,n)  +  e,  Vtt  e  n , 


7 


where  e  >  0.  If  a  policy  is  discount  £-optimal  for  all  distributions  p  €  'P(S),  then  it  is 
simply  called  discount  £-optimal.  If  a  policy  is  discount  £-optimal  for  all  £  >  0,  then  it  is 
called  discount  optimal.  The  (optimal)  value  function  is  given  by 

(2.4)  J*p(p)  :=  inf  Jp(p,v). 

Also,  if  p  is  concentrated  at  x  6  S,  we  denote  the  value  function  by  Jp(x).  Similar  definitions 
apply  to  other  criteria;  Jj(^)  and  J*(p)  will  denote  the  optimal  value  functions  for  the  TC 
and  AC  criteria,  respectively.  For  sample  path  AC,  we  define  an  optimal  policy  as  follows: 
We  say  that  a  policy  7r*  £  II  is  sample  path  AC  optimal  (or  a.s.  AC  optimal)  if  there  exists 
a  constant  p*  such  that  for  any  initial  law  p 

rs(p,n')  =  p',  7>;'-a.s., 

while  for  any  other  policy  ir  €  II  and  any  initial  law  p1 

^,-a.s. 

The  constant  p*  is  the  sample  path  optimal  average  cost. 

Having  defined  various  optimality  criteria,  and  the  set  of  admissible  policies  n,  the 
obvious  question  at  this  point  is:  do  there  exist  optimal  policies?  Without  imposing  further 
assumptions  on  our  general  model,  the  answer  is  no.  One  of  the  reasons  behind  this  is  that 
the  Borel  measurability  assumption  in  the  definition  of  admissible  policies  is  too  restrictive, 
in  general,  to  be  able  to  attain  the  infimum  in  (2.4).  To  circumvent  this  problem,  either  a 
broader  sense  of  measurability  is  allowed,  i.e.,  a  larger  set  of  admissible  policies  is  used,  or 
further  assumptions  are  imposed.  The  first  approach  was  taken  by  Shreve  and  Bertsekas 
[15],  [160],  [161],  who  considered  universally  measurable  policies,  a  class  properly  containing 
the  (Borel  measurable)  admissible  policies  defined  previously;  see  also  [50].  We  will  instead 
follow  the  second  approach  mentioned  above  and  concentrate  on  the  semicontinuous  model, 
as  studied  in  [15],  [46],  [50],  [69],  [86],  [119],  [148]— [150]. 

2.4.  The  semicontinuous  model.  In  general,  we  consider  the  case  when  the  one-stage 
cost  function  c(-,-)  is  unbounded.  Since  for  the  most  part  the  criteria  considered  in  this 
paper  are  given  by  a  sum  of  expected  costs  over  the  infinite  horizon,  then  in  order  to  avoid 
indeterminate  situations,  the  following  conditions  will  be  assumed  to  hold  throughout  the 
rest  of  the  paper,  unless  otherwise  indicated. 

(A2.1)  c(x,a)  >  0,  for  all  (x,a)  €  K. 

(A2.2)  The  transition  kernel  P( ■  |  x,a )  is  weakly-continuous  in  (x,a);  that  is,  u(-)  €  Cb{S) 
implies  fs  v(y)P(dy  |  • ,  •)  €  Cb(K). 

(A2.3)  (i)  The  multifunction  U(x)  is  upper  semicontinuous; 

(ii)  c(-,-)  e  C(K). 

Remark  2.1.  Concerning  (A2.1),  note  that  we  only  need  to  assume  the  cost  is  bounded 
below.  The  assumption  that  the  cost  is  nonnegative  is  only  made  for  convenience  and  does 
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not  result  in  any  loss  of  generality.  Assumption  (A2.2)  is  equivalent  to  f  v(y)P(dy  \  ■ ,  •)  £ 
£(K),  for  each  v(-)  £  £(S)  [50,  p.  52].  This  property  is  crucial  in  our  development. 

Example  2.3.  For  the  nonlinear  stochastic  system  in  Example  2.1,  assume  further  that: 

(i)  A  is  compact, 

(ii)  for  each  x  £  S,  U(x)  is  closed  (and  therefore  compact),  and 

(iii)  the  system  function  F  :  K  xW  —>  S  is  continuous. 

If  c('i‘)  £  £(-K”)»  then,  by  Remark  2.1,  (A2.2)  will  hold.  Furthermore,  the  assumption  on 
the  compactness  of  A  can  be  dispensed  with  if  there  are  compact  subsets  K\  C  iU2  C  . . . 
in  5  x  A,  such  that  K  =  (Jn€H  •K’n  and 

liminf  {c(x,a)  :  (x,a)  £  Kn  \  Kn_i}  =  +oo  , 

n  — >oo  ' 

since  in  this  case  A  can  be  conveniently  compactified,  cf.  [15,  Cor.  8.6.1,  p.  210].  Also,  the 
case  in  which  S  =  R",  A  =  Rm,  and  c(x,a)  =  x'Qx  +  a' Ra ,  where  Q  and  R  are  positive 
semidefinite  and  positive  definite  matrices,  respectively,  of  appropriate  dimensions  can  also 
be  considered  by  a  (one  point)  compactification  of  A  [160,  pp.  965-966]. 

Under  (A2.1)-(A2.3),  the  undiscounted  dynamic  programming  map  T  given  by 

(2.5)  T(v)( x):=  inf  jc(z,a)+  /  v(y)P(dy  \  x,a)\,  V  x  €  S  , 

aeO(x)  (.  Js  J 

maps  £(S)  into  itself.  Also,  for  0  <  (3  <  1,  the  discounted  dynamic  programming  map 
Tp  :  £(S)  — ►  £{S)  is  given  by 

(2-6)  Tp(v)  :=  T(l3v) . 

The  following  properties  are  easily  verified. 

Lemma  2.1.  Let  v,v'  €  C(S).  Then 

(i)  for  all  k  £  R,  T(v  +  k )  =  T(v)  +  k; 

(ii)  ifv  <  v1,  then  T(v)  <  T(v'). 

Some  key  results  for  the  stochastic  control  problem  under  a  DC  criterion  are  summarized 
in  the  following  theorem. 

Theorem  2.1.  Under  the  hypotheses  (A2.1)-(A2.3), 

(i)  the  following  equation,  which  is  called  discounted  cost  optimality  equation  (DCOE), 
holds 

(2.7)  Jp{x)  =  T0(Jp){x)  =  inf  j c(x,a)  +  P  [  Jp{y)P(dy  |  x,a)\  ,  x  €  S  ; 

»€!/(*)  (.  Js  J 

(ii)  a  policy  ir*  £  II5D  is  discount  optimal  if  and  only  if  7r*(x)  attains  the  infimum  in 
(2.7),  for  all  x  £  S; 

(iii)  a  discount  optimal  policy  tt*  £  IIsd  exists; 
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(iv)  define  Tp  :  C(S)  — ►  C(S)  as  the  identity  operator  and  Tp  :  C(S)  — »  £(S),  fc  €  N, 
by  Tp(f)  :=  Tp(Tp~l(f)).  Then  for  any  f  €  £t(S) 

Tp(f)(x)  - ♦  Jpix) ,  for  all  x  e  S; 

k — ►oo 

(v)  •/))(•)  is  nonnegative  and  lower  semicontinuous. 

Remark  2.2.  The  above  results  are  essentially  contained  in  [15],  [50].  The  usual  approach  is 
to  prove  the  existence  of  a  solution  to  the  DCOE  via  a  contraction  mapping  theorem  [80]. 
The  existence  of  a  measurable  selector  which  attains  the  infimum  in  (2.7),  e.g.  the  result  in 
(hi)  of  Theorem  2.1,  follows  from  [15,  Prop.  7.33,  p.  153],  [29],  [46,  pp.  35-38],  [50,  Sect.  2.6], 
[86],  [135,  Th.  4.1,  p.  9],  [180,  Th.  9.1,  p.  880].  The  scheme  used  in  (iv)  of  Theorem  2.1  to 
compute  Jp(-)  is  called  the  value  iteration  (or  successive  approximations)  algorithm.  Note 
that  Jp(-)  is  not  necessarily  the  only  fixed  point  of  Tp\  however,  Jp(-)  is  the  minimal  fixed 
point  of  Tp  among  the  class  of  nonnegative  functions  in  £(S)  [15,  Chap.  5]. 

§3.  A  Sketch  of  Historical  Development 

We  now  present  a  brief  historical  sketch  of  the  development  of  CMP,  with  an  emphasis 
on  the  average  cost  criterion.  The  roots  of  CMP  can  be  traced  back  to  the  pioneering  work 
of  Wald  [182],  [183]  on  sequential  analysis  and  statistical  decision  functions.  In  the  late 
1940’s  and  early  1950’s,  several  investigators  formulated  the  essential  concepts  of  CMP, 
which  are  found  in  their  work  in  sequential  game  models.  A  CMP  can  be  viewed  as  a 
one-player  game.  Of  particular  interest  is  the  work  of  Bellman  and  Blackwell  [12],  Bellman 
and  LaSalle  [13],  and  also  Shapley,  who  formulated  the  essential  mechanism  of  stochastic 
dynamic  programming  and  used  the  theory  of  contraction  mappings  [156].  Using  his  famous 
heuristic  ‘minimum  cost  to  go,’  Bellman  showed  how  powerful  the  dynamic  programming 
technique  was,  by  using  it  to  solve  problems  in  a  myriad  of  settings  [9]— [1 1].  Bellman 
studied  mostly  problems  with  a  finite  horizon,  for  which  the  backward  induction  approach 
of  dynamic  programming  suffices  to  give  a  complete  treatment.  The  situation  is  quite 
different  in  problems  over  an  infinite  horizon.  Early  work  on  CMP  is  also  reported  in 
econometrics  [4],  [48]. 

Howard  [92]  was  apparently  the  first  to  study  CMP  with  an  average  cost  criterion.  His 
policy  iteration  algorithm  was  the  first  major  computational  breakthrough,  and  his  book 
helped  establish  CMP  as  an  independent  subject  of  investigation.  For  CMP  with  finite  state 
and  action  spaces,  Howard’s  policy  iteration  scheme  established  the  existence  of  a  stationary 
deterministic  policy,  optimal  in  this  class  only.  Derman  [37]  and  Viskov  and  Shiryaev 
[179]  independently  showed  that  this  policy  was  optimal  among  all  admissible  policies. 
Other  computational  methods  were  later  proposed.  Manne  [121]  gave  a  linear  programming 
formulation  for  the  AC  criterion,  and  Wagner  [181]  later  characterized  extreme-point  optima 
of  the  linear  program  as  stationary  deterministic  policies.  White  [193]  introduced  the  value 
iteration  (or  successive  approximations)  technique.  Excellent  accounts  of  these  and  other 
computational  methods  are  given  in  [14,  Sect.  5.2]  and  [133]. 
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On  the  theoretical  side,  Blackwell’s  seminal  paper  [18]  gave  considerable  impetus  to 
research  in  this  area,  motivating  numerous  other  papers.  In  [18]  Blackwell  studied  CMP 
with  finite  state  and  action  spaces.  He  considered  the  DC  criterion  in  great  detail,  and 
established  many  important  results.  In  the  same  paper,  he  initiated  an  approach  for  the 
AC  case  which  we  will  refer  to  as  the  vanishing  discount  approach:  he  treated  the  AC  case 
as  a  limit  of  the  DC  case,  when  the  discount  factor  goes  to  1,  i.e.,  the  discounting  effect 
vanishes.  Blackwell  established  in  [18]  the  existence  of  a  stationary  deterministic  policy 
which  is  discount  optimal,  for  all  j3  sufficiently  close  to  one.  This  type  of  optimality  is 
now  called  Blackwell  optimality  [14,  pp.  336-341].  The  relation  between  the  discounted  and 
average  case  also  becomes  apparent  via  Tauberian  theorems  [85,  Sect.  4.6].  This  fact  seems 
to  have  been  observed  first  by  Gillette  [77],  who  used  Tauberian  theorems  to  establish  the 
existence  of  optimal  stationary  policies  in  a  stochastic  game  problem  with  an  AC  criterion. 
Also,  using  Tauberian  theorems,  Derman  [37]  showed  that  the  Blackwell  optimal  policy 
found  in  [18]  was  also  optimal  for  the  AC  criterion.  Average  cost  CMP  with  finite  state  and 
arbitrary  action  spaces  were  studied  under  various  conditions  in  the  works  of  [35],  [55]— [57], 

[97]- 

Blackwell  optimal  policies  do  not  necessarily  exist  when  the  state  space  is  countably 
infinite  [118].  In  fact,  average  optimal  policies  need  not  exist  in  this  situation  [117],  [146]. 
Similar  non-existence  result  holds  when  the  state  space  is  finite  but  the  action  spai.ce  is  an 
arbitrary  compact  metric  space  [8].  For  such  models  the  existence  of  an  optimal  policy  has 
been  proved  by  Bather  [8],  Martin-Lof  [122]  and  Feinberg  [56],  under  certain  conditions. 
Derman  [38]  studied  the  problem  with  countable  state  space,  finite  action  space  and  bounded 
cost.  He  studied  the  following  equation  which  became  known  as  the  average  cost  optimality 
equation  (ACOE) 


p  +  h(i) 


min 
a  eu( 


jn .  |C0>)  +  X]P(;’ 

'  ^  jes 


i,a)h(j) 


where  p  is  a  scalar,  h  :  S  — >  R,  S  =  No,  and  we  write  P(j  |  •,■)  for  P({j}  |  •,•)• 
He  showed  that  if  the  ACOE  has  a  hounded  solution,  i.e,  a  solution  (p,h)  with  h(- )  a 
bounded  function,  then  the  stationary  deterministic  policy  realizing  the  pointwise  minimum 
on  the  right-hand  side  of  the  ACOE  is  average  optimal,  and  p  is  the  minimum  average 
cost.  Derman ’s  paper,  in  conjunction  with  Derman- Veinott  [42],  showed  that  a  sufficient 
condition  for  the  existence  of  such  a  solution  was  that  the  expected  hitting  time  of  a  fixed 
state  under  any  stationary  deterministic  policy  is  bounded  uniformly  with  respect  to  the 
choice  of  the  policy  and  the  initial  state.  Motivated  by  Blackwell’s  work,  Taylor  [173] 
extended  the  vanishing  discount  approach  to  obtain  a  bounded  solution  for  a  Markovian 
sequential  replacement  problem  by  studying  the  asymptotics  of  the  differential  discounted 
value  function  hp(-)  :=  Jp{-)-Jp{ 0).  Ross  [143],  [144]  refined  Taylor’s  procedure  and  showed 
that,  under  the  Derman- Veinott  [42]  condition,  {/r^a ( *)} /3e(o ,i)  was  uniformly  bounded  in  /3. 
By  letting  f3  |  1  Ross  established  that  the  ACOE  had  a  bounded  solution.  This  made  the 
vanishing  discount  approach  very  popular.  In  subsequent  works  many  variants  of  Derman- 
Veinott  recurrence  conditions  appeared.  See  [51],  [174]  for  a  great  variety  of  such  conditions. 
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These  conditions  are  difficult  to  remove,  and  counterexamples  abound  [146].  Actually,  it  has 
been  shown  in  [62],  in  a  very  general  setting,  that  the  uniform  boundedness  of  {h/?(-)}/3e(o,i) 
in  /3  is  also  a  necessary  condition  for  a  bounded  solution  to  the  ACOE  to  exist.  Cavazos- 
Cadena  [30],  [31],  under  some  additional  conditions,  showed  that  the  existence  of  bounded 
solutions  to  the  ACOE  necessarily  impose  a  very  strong  recurrence  structure  on  the  model. 
Lippman  [112]  studied  controlled  semi-Markov  processes  with  unbounded  cost  with  both 
discounted  and  average  cost  criteria.  Following  the  vanishing  discount  approach  he  derived 
results  for  the  average  cost  case  under  several  restrictive  assumptions.  Federgruen  et  al. 
[52]  have  explored  the  same  approach. 

Hordijk  [89]  extended  many  earlier  results  to  countable  state  space  and  compact  action 
spaces.  He  introduced  the  Lyapunov  function  method  for  CMP.  He  used  this  method  to 
obtain  a  (possibly  unbounded )  solution  to  the  ACOE,  yielding  an  optimal  policy.  However, 
the  Lyapunov  function  method  necessarily  imposes  a  blanket  stability  of  the  processes  (in 
the  sense  of  positive  recurrence).  Such  stability  is  not  always  met  in,  e.g.,  many  queueing 
model  applications.  In  addition,  he  introduced  some  new  concepts,  particularly  based  on 
the  relation  of  stochastic  dynamic  programming  with  Markov  potential  theory.  There  is  a 
vast  amount  of  literature  devoted  to  CMP  in  several  volumes  of  the  Mathematisch  Centrum 
tracts;  see  [177]  and  the  references  therein. 

With  Hordijk’s  work,  it  appeared  that  a  shift  away  from  the  vanishing  discount  approach 
was  necessary.  Rosberg-Varaiya-Walrand  [140]  treated  the  average  cost  criterion  as  the 
limiting  case  of  the  finite  horizon  problem,  but  details  of  their  arguments  depend  heavily  on 
the  specifics  of  the  problem  they  consider,  viz.  the  control  of  two  queues  in  tandem  with  a 
linear  cost  structure.  Federgruen  and  Tijms  [54]  initiated  a  direct  study  of  the  ACOE  by  a 
span  semi-norm  method,  for  bounded  costs.  This  method  allows  one  to  obtain  useful  value 
iteration  algorithms.  Later  Federgruen  et  al.  [53]  treated  the  problem  with  countable  state 
space  and  unbounded  costs.  Assuming  a  recurrence  condition  on  the  model,  they  established 
the  existence  of  a  (possibly  unbounded)  solution  to  the  ACOE,  thereby  establishing  the 
existence  of  an  optimal  stationary  deterministic  policy. 

In  a  series  of  papers  [20]— [25]  Borkar  presented  a  convex  analytic  approach  to  treat 
the  problem  with  countable  state  space,  compact  action  space  and  unbounded  cost.  This 
approach  can  be  seen  as  an  extension  of  the  ideas  in  Manne’s  [121]  and  Wagner’s  [181]  work. 
Borkar  stressed  the  existence  of  an  optimal  stable  stationary  deterministic  policy,  i.e.,  one 
that  induces  a  positive  recurrent  process.  While  a  blanket  stability  assumption  (e.g.  of 
Lyapunov  type)  may  be  too  restrictive  to  cover  many  queueing  applications,  it  nevertheless 
is  desirable  that  the  optimal  policy  be  stable.  Borkar  showed  that  to  obtain  an  optimal 
stable  stationary  deterministic  policy  either  a  blanket  stability  hypothesis  or  a  condition  on 
the  cost  that  penalizes  unstable  behavior  is  necessary.  He  also  emphasized  the  concept  of 
almost  sure  optimality  by  a  ‘pathwise’  treatment  of  the  problem.  A  comprehensive  account 
of  the  convex  analytic  approach  to  CMP  is  given  in  [26]. 

After  the  extensive  works  of  Hordijk,  Federgruen  et  al.,  and  Borkar,  it  seemed  that 
the  vanishing  discount  approach  was  not  appropriate  for  many  classes  of  problems  with 
unbounded  costs.  However,  this  approach  has  been  revived  and  generalized  to  a  great 
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extent  in  [17],  [59],  [61],  [72],  [74],  [75],  [81],  [83],  [151],  [152],  [163],  [168],  [186].  In  some 
of  these  references,  an  inequality  version  of  the  ACOE  is  studied.  In  view  of  the  results  of 
[30],  [31],  [62],  it  is  clear  that  a  bounded  solution  to  the  ACOE  is  too  restrictive,  in  general. 
A  natural  candidate  solution  is  one  that  is  bounded  below  [28],  [74],  [83],  [151],  [152], 
[168],  [186],  or  one  having  suitable  growth  properties  [28],  or  satisfying  other  conditions 
[163].  Weber  and  Stidham  [168],  [186],  studied  the  problem  for  queueing  systems.  Under 
a  penalizing  condition  on  the  cost  and  some  structural  assumptions,  they  established  the 
existence  of  a  (possibly  unbounded)  solution  to  the  ACOE,  and  showed  the  existence  of 
an  optimal  stationary  deterministic  policy.  Sennott  proceeded  along  similar  lines.  She 
identified  very  general  conditions  on  the  discounted  value  function  so  that  the  vanishing 
discount  approach  could  successfully  be  pursued.  We  refer  to  [151]-[153],  [168],  [186]  for 
many  interesting  examples  of  queueing  systems,  and  to  [34]  for  a  comparison  of  different 
sets  of  assumptions.  Extensions  of  these  techniques  to  semi-Markov  decisions  processes  with 
applications  to  queueing  systems  have  been  reported  in  [153]. 

The  first  attempt  to  give  a  description  of  CMP  with  more  general  state  and  actions  spaces 
was  carried  out  by  Karlin  [95].  Blackwell  [19],  Maitra  [119]  and  Strauch  [169]  studied  CMP 
with  a  general  state  space  and  the  discounted  cost  criterion.  Their  work  was  significantly 
extended  by  Shreve  and  Bertsekas  in  [15],  [160],  [161].  Feinberg  [58]  studied  CMP  with 
Borel  state  space  and  with  arbitrary  numerical  criteria,  which  include  TC,  AC,  DC  as 
particular  cases.  By  establishing  the  convexity  of  the  set  of  strategic  measures  (measures  of 
the  type  V *  on  the  canonical  space)  he  established  the  existence  of  an  £ -optimal  f  €  Usd 
for  these  criteria.  De  Leve  [109],  [110],  [ill]  considered  general  state  and  action  space  CMP 
in  continuous  time  with  an  AC  criterion,  with  an  emphasis  on  the  ergodic  behavior  of  the 
processes.  Ross  [144]  used  the  vanishing  discount  approach  to  study  CMP  with  an  AC 
criterion,  general  state  space,  finite  action  space,  and  bounded  cost  function.  He  showed 
that  if  the  family  of  differential  discounted  value  functions  {h/?(-)},3€( o.i)  is  equicontinuous 
and  uniformly  bounded,  then  the  ACOE  admits  a  bounded  solution,  yielding  an  optimal 
stationary  deterministic  policy.  Ross  also  introduced  the  concept  of  minorant.  He  showed 
that  if  there  exists  a  state  xq  £  S  and  a  >  0  such  that 

P(x0  |  x ,a)  >  a,  for  all  a  G  U(x),  x  6  S 

then  the  average  cost  case  could  be  reduced  to  a  discounted  one.  This  was  greatly  extended 
in  the  work  of  Gubenko  and  Statland  [78]  (see  also  [42]).  They  showed  that  under  similar 
minorant  (or  majorant)  conditions  a  contraction  map,  with  respect  to  the  sup  norm,  could 
be  defined  on  Mb(S),  which  would  yield  a  bounded  solution  to  the  ACOE.  They  also 
obtained  bounded  solutions  to  the  ACOE  under  continuity  and  boundedness  conditions 
which  guarantee  that  {hpn(-)},  with  f3n  j  1,  is  uniformly  bounded  and  equicontinuous,  and 
thus  a  similar  approach  as  in  [144]  can  be  followed.  Georgin  [70],  [71]  also  explored  this 
approach,  under  some  ergodicity  conditions.  Tijms  [175]  and  Hiibner  [93]  directly  studied 
the  ACOE,  under  some  ergodicity  assumptions,  by  showing  that  the  undiscounted  dynamic 
programming  map  is  a  contraction  on  At b(S)t  with  respect  to  the  span  seminorm.  For 
an  excellent  presentation  of  these  methods,  and  the  type  of  ergodicity  conditions  used, 
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see  [80,  Sect.  3.3].  Wijngaard  [197],  [198]  and  Kumar  [98]  studied  the  problem  under 
Doeblin’s  condition  using  an  operator  theoretic  method.  Under  several  conditions,  Kurano 
[101]  obtained  solutions  to  the  ACOE  and  also  showed  the  existence  of  an  average  optimal 
stationary  deterministic  policy.  Also,  in  [102]-[104],  he  obtained  the  existence  of  an  optimal 
stationary  deterministic  policy  under  Doeblin’s  condition.  For  a  comprehensive  presentation 
of  the  different  recurrence  conditions  used  for  the  above  purposes,  see  [84]. 

The  study  of  partially  observable  controlled  Markov  processes  (POCMP)  was  initiated 
independently  by  various  authors  [5],  [45],  [49],  [157],  [158].  The  reduction  to  models  with 
complete  information  was  done  in  various  cases  in  [5],  [134],  [147],  [201].  The  study  of 
finite  state  space  POCMP  with  an  AC  criterion  was  initiated  by  Sondik  [166].  Trainsform- 
ing  the  problem  into  an  equivalent  completely  observable  (CO)  problem  with  Borel  state 
space,  Sondik  tried  to  cast  the  problem  in  the  framework  of  Ross  [144],  but  did  not  show 
equicontinuity  of  {hp(-)}  Ross  [146],  Wang  [185],  and  White  [187]  showed  this  for 

specific,  scalar  replacement  problems.  Ohnishi  et  al.  [128]  studied  a  multi-state  replacement 
problem  by  using  concavity  properties  of  hp(-).  Platzman  studied  the  general  problem  of 
finite  state  and  action  space  POCMP,  also  by  using  concavity  properties  of  the  functions 
hp(-).  Under  certain  reachability  conditions  he  proved  that  the  family  {/i/3(‘))/3€(o,i)  is 
uniformly  bounded.  However,  even  though  this  family  may  not  be  equicontinuous  with 
respect  to  the  Euclidean  metric,  he  showed  that  it  is  equi-Lipschitzian  with  respect  to  some 
other  appropriate  metric,  thus  putting  the  problem  within  the  framework  of  Ross  [144]. 
Fernandez- Gaucherand  et  al.  [60],  [61]  followed  a  different  approach  to  the  problem,  using 
the  concepts  of  invariant  sets  of  a  CMP  and  controlled  sub-Markov  processes.  This  ap¬ 
proach  allows  one  to  consider  POCMP  with  countable  state  and  observation  spaces.  Borkar 
[26]  also  studied  the  problem  via  his  convex  analytic  approach. 


§4.  Finite  State  Space 

In  this  section  we  will  consider  models  with  a  finite  state  space.  Initially,  we  restrict 
our  attention  to  the  case  when  A  is  a  finite  set;  models  with  compact  action  space  will  be 
discussed  at  the  end  of  the  section. 

4.1.  Finite  action  spaces.  Let  5  =  {1, . . . ,  k}.  In  this  case  Hso  is  finite.  This  fact  plays 
a  crucial  role  in  the  analysis  for  the  average  cost  problem.  For  a  policy  n  £  n,  let  Jp(x) 
denote  the  vector  (Jp(l,  tt),  .  -  - ,  Jp(k,  rr))  ;  similarly  we  define  r),  J( 7r),  J^,  J*  and 
J For  a  stationary  deterministic  policy  /  €  ns£),  let  jP(/)  denote  the  transition  matrix 
of  the  corresponding  process  and 

c(f)  :=  (c(l,/(l)),...,c(fc,/(fc)))T. 


Also,  the  (i,j)-th  entry  in  the  n-th  power  of  the  transition  matrix  P(f)  will  be  denoted  by 
PPj(f)  or  Pn(f){i,j)-  It  is  well  known  that 


lim 

N-^OO 


:=  r-(f) 


71  =  0 
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exists  [18],  [85,  Chap.  4],  [133],  where  P°{f)  =  I  (the  k  x  k  identity  matrix).  Using  the 
theory  of  stochastic  matrices  the  following  results  can  be  proved.  For  details  see  [8],  [18], 
[85],  [133]. 

Theorem  4.1.  For  each  f  €  Ilsn, 

(i)  J(f)  =  P*(/)c(/) 

(ii)  The  number  of  linearly  independent  equations  in 

(I  -  P(f))w  =  c(f)  -  ](1) 

is  k  minus  the  number  of  communicating  classes  in  P(f). 

(iii)  The  equations 

(4.1)  (I-  P(f))w  =  c(f)~  v 

(4.2)  =  0 

have  solutions  v  =  J(f)  and  w  =  tu(/)  where 

w(f)  :=  (I  -  P(f)  +  P'(/)) _1  (I  -  P*(/))c(/)  • 

(iv)  v  =  J(f)  and  w  =  w(f)  are  the  unique  solutions  to  (4.1)  and  (4.2)  for  which  v(s)  = 
v(s')  if  s  and  s'  are  in  the  same  communicating  class  of  P(f),  and  v(s)  =  J(s,f )  if 
state  s  is  transient  in  P(f). 

Remark  4-T 

(a)  It  is  easily  seen  from  the  above  theorem  that  if  under  an  /  G  II50,  the  process  is 
irreducible  or  unichain  (see  [85]),  then  J(- ,/)  is  constant. 

(b)  The  matrix 

H(f)  :=  [I  -  P(f)  +  P‘(/))_1  ( I  -  P*(/)) 

is  called  the  deviation  matrix.  It  plays  a  fundamental  role  in  the  analysis.  For  the 
discounted  case  Jp(f)  —  (I  —  f3P(f))  c(/) .  Analogous  results  can  be  developed  for 

the  average  cost  case  using  H(f).  The  following  result,  due  to  Miller  and  Veinott 
[123]  and  Lamond  and  Puterman  [107],  can  be  proved  using  the  spectral  theory  of 
stochastic  matrices. 

Theorem  4.2.  Let  (3  e  [0,1)  and  A  =  (1  -  /3)P~l .  Let  f  €  II5D  and  u  the  eigenvalue  of 
P(f)  less  than  one  with  largest  modulus.  If  0  <  A  <  1  —  \v\,  then 

OO 

(4.3)  (XI +  1-  P)-1  =  A-1P*(/)  +  £(-A )nHn+\f) 

n= 0 

and 

OO 

(4.4)  J0(f)  =  (1  +  A)  A-1  P*(/)c(/)  +  £(-A n)Pn+1(f)c(f) 

n= 0 
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Remark  4-2. 

(a)  The  quantity  h(f)  :=  H(f)c(f)  plays  a  crucial  role  in  the  analysis  of  the  problem. 
It  is  called  the  bias  or  transient  cost.  It  can  be  easily  seen  from  the  Neumann  series 
expansion  of  (J  -  P(f)  +  P*(/))  1  [133]  that  for  s  £  S 

OO 

h(f)(s)  =  E{  £(c(Xt,/(Xt))~  J(Xufj)  • 

.1=0 

From  the  above  representation  h(f)  can  be  interpreted  as  the  expected  total  cost 
for  a  CMP  with  cost  c  —  J.  If  P(f)  is  aperiodic,  the  distribution  of  Xt  converges 
to  a  limiting  distribution,  so  eventually  c(Xt,  f(Xt))  and  J(Xt,f)  will  differ  very 
little.  Thus,  h(f)  can  be  thought  of  as  the  expected  total  cost  ‘until  convergence’  or 
the  expected  total  cost  during  the  ‘transient’  phase  of  the  evolution  of  the  process 
[133], 

(b)  Howard  [92]  has  shown  that 

Mf)  =  NJ(f)  +  h(f)  +  o(  1). 

Therefore,  as  N  becomes  large,  for  each  s  £  S,  J fr(f)  approaches  a  straight  line 
with  slope  J(/)  and  intercept  h(f).  When  J(/)(s)  is  constant,  Jn(s)  -  Jn(s') 
approaches  h(f)(s)  -  h(f)(s'),  so  that  h(f)  is  the  asymptotic  relative  difference  of 
starting  the  process  in  two  states  s  and  s' .  That  is  why  h(f)  is  often  referred  to  as 
the  relative  value. 

(c)  The  expansion  (4.4)  extends  Blackwell’s  expansion  [18]. 

(d)  Using  the  expansion  (4.4),  the  following  important  result  is  immediate. 

Corollary  4.1.  For  /  £  Hsd,  «/(/)  =  lim^i  (1  -  0)Jp(f). 

Following  Blackwell  [18]  and  Derman  [39]  we  now  prove  the  following  existence  results. 

Theorem  4.3.  There  exists  an  f  €  Usd  which  is  discount  optimal  for  all  (3  sufficiently 
close  to  1  and  is  also  optimal  for  the  average  cost  criterion. 

Proof.  For  each  f  £  Usd  and  s  £  S,  Jp(s,f)  is  obviously  an  analytic  function  of  (3.  Let 
{/3n},  0  <  /3n  <  1  be  a  sequence  such  that  (3n  f  1.  For  a  fixed  n,  let  fn  €  Hsc  be  /3n-discount 
optimal  (see  Theorem  2.1).  Since  Hsd  is  a  finite  set,  the  sequence  {/n}  must  contain  at 
least  one  /*  £  Hs#  which  occurs  infinitely  often.  Let  {/3ni}  be  a  subsequence  of  {/3„ }  such 
that  (3nk  |  1  and  /*  =  fHl  =  fn2  =  •  •  ■ .  Then  for  every  g  £  n,  Jpnk{f‘)  <  Since 

all  coordinates  of  Jp(f  *)  and  Jp{g)  are  analytic  functions  of  /?,  it  follows  that 

W)  <  Mff) 

for  all  /3  near  1.  Since  this  holds  for  all  g  £  II,  it  follows  that  /*  is  /3-discount  optimal  for 
all  P  near  1.  We  next  show  that  /*  is  average  optimal.  Let  7r  6  n.  Then 

(1  -  Au  )jpnt  (/-)<(!—  /3nt  )j0nt  Or) ,  k  =  1,2,.... 
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Therefore,  letting  k  -*  oo  and  using  Theorem  4.1  and  a  standard  Tauberian  theorem  (The¬ 
orem  A.2  in  the  Appendix),  it  follows  that 

Jin  =  gin  (l-/?)  md 
=  lim  (1  -Pnk)Jp„kU') 

ifc-*oo 

<  lim  sup  (1  -  Pn„)Jp  (tt)  <  J(ir) 

k—+  oo 

and  the  proof  is  complete.  □ 

We  now  briefly  mention  three  numerical  approaches.  For  details  we  refer  to  [14],  [133], 
[176],  etc.  Our  presentation  follows  [133]. 

Value  iteration.  We  assume  that  under  any  /  €  Ils/t,  the  corresponding  chain  is  unichain 
and  aperiodic.  For  any  positive  integer  A,  the  finite  horizon  value  function  satisfies  the 
equation 

(4.5)  Jn+i  =  min  {c(f)  +  P(f)rN}. 

JEHsd 

The  equation  (4.5)  can  act  as  an  iteration  equation  with  Jq  =  0  as  the  initial  condition.  Let 
//v+i  G  Ilsn  realize  the  minimum  in  (4.5).  We  can  treat  jjJpf  and  as  our  guesses  for 
J*  and  an  average  optimal  policy.  Then  -  N  J*  converges  as  N  —*  oo.  Also,  there  exists 
and  integer  N0  such  that  for  any  N  >  N0,  any  /  6  IIsd  which  attains  the  minimum  in  (4.5) 
is  average  optimal.  However,  this  property  does  not  yield  an  error  estimate  and  hence  fails 
to  provide  a  stopping  rule  for  the  iteration  scheme.  To  this  end,  with  h  —  (/i(l), . . . ,  h(fc)), 
we  let 


L(h )  :=  min {Th(x)  —  h(x )} 

U(h)  :=  max{Th(x)  -  /i(x)}. 

x£S 

It  can  be  shown  that  [133] 

min{ J)v(i)  -  <  max{  J^(z)  -  </^_i(z)} 

x£S  xfcS 

and 

L{J'N_ i)  <  L{J*n)  <  J’  <  U(J’N)  < 

Furthermore, 

lim  {u(rN)-L(rN)}  =  o. 

N—+oo 

Thus,  an  average  s-optimal  policy  can  be  found  by  stopping  the  value  iteration  when 

u(rN)-L(rN)<e. 

There  are  other  variants  of  this  approach,  e.g.,  the  relative  value  iteration  [193]  and  the 
span  contraction  method  [54]. 
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Linear  programming.  To  simplify  our  presentation,  we  will  assume  that  under  any  /  £ 
IIsr,  the  corresponding  process  is  irreducible.  Let  P(f)  denote  the  transition  matrix  of 
the  process  and  7 j(f)  £  V(S)  its  invariant  measure.  Then  for  any  s  £  S,  J(s,f)  =  J{f),  a 
constant,  and 

Af)  =  E  E  c(s,a)f(s,a)Tj(f)(s). 

s£S  a(zU(s) 

Therefore,  the  average  cost  problem  can  be  reduced  to  the  following  linear  programming 
problem: 

(4.6a)  minimize  E  E 

4€S  a£U(s) 

subject  to 

(4.6b)  x(s,a)>0,  s  £  S ,  a£U(s), 


(4.6c) 


E  x(s’a)  = 

s€S  a£U(s) 


(4.6d)  E  i^a)=  E  E  x(s',a)P(s'\s,a). 

a€U(s)  s'eSaeU(s') 

Under  the  irreducibility  assumption  the  simplex  method  can  be  employed  to  find  an  optimal 
stationary  deterministic  policy.  This  formulation  is  due  to  Manne  [121]. 

Policy  improvement.  We  work  under  the  irreducibility  assumption.  The  dual  to  the 
linear  program  (4.6a)-(4.6d)  is  the  problem  of  finding  variables  g  and  h(s),  s  6  S,  in  order 
to 


(4.7a) 


maximize  g 


subject  to 


(4.7b)  g  +  'Y  ~  P(s'  I  s,a))h(s)  <  c(s,a) 

s'es 


( s,a )  €  S  x  U(s)  where  S(s,s')  is  the  Kronecker  delta. 
The  functional  equation 


(4.8) 


g  +  h(s)  =  min 

a  £U(s) 


c(si a)  +  Y  P(5'  I 

s'es 


is  equivalent  to  (4.7a)-(4.7b)  under  the  irreducibility  assumption  and  is  the  average  cost 
optimality  equation  [85].  We  will  discuss  this  equation  in  detail  in  the  next  section.  It  will 
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be  shown  that  an  /  £  lisp  is  optimal  if  and  only  if  /  realizes  the  pointwise  minimum;  in  (4.8) 
and  then  g  is  the  optimal  average  cost.  This  suggests  the  following  iteration  algorithm. 

(i)  Let  n  =  l.  Choose  /„  £  Hsp.  Let  hn(s)  =  0  for  all  s  £  S. 

(ii)  Find  a  solution  gn  and  hn(s)  of  the  following  equation: 

9n  +  K(s)  =  c(s,fn(s))  +  ^2  P{s'  I  S,fn(s))hn(s )  . 

s'es 

(iii)  For  each  s  £  S,  compute 

&»(*)  =  Tr,™n  ,  ,Jc(sia)  +  J2  P(S>  i  5>a)/ln(3')  >  -9n  -  hn(s)  . 
«et/(*)\{/„(a)}  {  J 

If  4>n(s)  >  0  for  all  s  €  S,  then  fn  is  average  optimal  and  gn  is  the  optimal  average 
cost.  If  4>n(s)  <  0  for  some  s  6  5,  then  pick  a  €  U(s)  such  that 

c(s,a)  +  P(s'  |  s,a)hn(s')  -  gn  -  hn(s)  <  0  . 
s'es 

Define  /n+1  €  lisp  as  fn+i(s)  =  a  and  /n+i(-)  =  fn{ ■)  otherwise.  Then  /n+1  yields 
a  lower  average  cost.  Since  lisp  is  finite,  the  policy  improvement  scheme  converges 
in  a  finite  number  of  steps. 

4.2.  Compact  action  spaces.  We  now  consider  the  problem  where  the  action  set  A  is 
not  finite  but  a  compact  metric  space.  In  this  situation  an  optimal  policy  may  not  exist;  see 
[50,  p.  178,  Example  1].  Note  that  here  lisp  is  no  longer  finite.  Under  certain  ergodicity 
assumptions  Martin-Lof  [122]  and  Feinberg  [55]  have  proved  the  existence  of  an  optimal 
/  €  lisp.  We  will  discuss  various  ergodicity  assumptions  on  a  countable  state  space  in 
detail  in  the  next  section.  Here  we  focus  on  £-optimal  policies  established  by  Chitashvili 
[35]  and  Feinberg  [56];  see  [50,  Chap.  7]. 

Theorem  4.4.  Under  (A2.1  )-(A2.3),  for  every  £  >  0  there  exists  an  s-optimal  f  6  Hsp. 
Proof  (Sketch).  For  /  €  Hsp,  let  J(/)  be  as  in  Theorem  4.1.  For  i  €  S,  let 

(4.9)  J(i)  =  inf  J(f)(i). 

J  tiiSD 

Clearly  Jm(i )  <  J(i),  for  each  i  £  5.  Corresponding  to  i  £  5,  select  an  fi  £  Hsp  such  that 

(4.10)  J(fi)(i)  <J(i)  +  £. 

The  set  A  =  {fi(j)  :  i,j  £  5}  is  obviously  finite.  Taking  the  action  set  to  be  A,  the 
preceding  results  can  be  applied  to  the  finite  CMP  (5,  A,P,  c).  For  this  model  there  exists 
a  stationary  deterministic  policy,  say  /*,  which  is  average  optimal.  Thus 

(4.11)  J(/*)(z)  <  <  J(i)  +  e  ,  for  each  i  £  5  . 
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Let 


P*(0  :=  lim  sup(l  -  (3)J*0(i) . 

m 

Then  by  Theorem  A.2 

(4.12)  P*(0  5:  </*(*)>  for  each  *  €  S . 

By  (4.11),  it  suffices  to  show  that  J*(i)  =  J(i),  for  each  i  €  S.  From  (4.12),  it  then  suffices 
to  show  that  p*(i)  >  J(i).  For  each  /3  G  (0,1),  let  fp  G  Ilsz>  be  /3-discount  optimal.  Let  / 
be  a  limit  point  of  fp  as  /3  j  1.  Then  using  (4.4)  (which  is  valid  in  this  case  as  well)  and 
(A2.1)-(A2.3),  it  can  be  shown  that  p*(i)  >  >  J(i).  □ 

§5.  Countable  State  Space 

The  average  cost  problem  becomes  much  more  complicated  when  the  state  space  is 
countable.  Maitra  [117]  has  given  a  counterexample  which  shows  that  there  need  not  exist 
an  optimal  policy.  In  [118]  Maitra  has  studied  a  particular  problem  in  which  there  does 
not  exist  any  policy  which  is  /3-discount  optimal  for  all  /3  sufficiently  close  to  1.  Flynn 
[66]  has  constructed  a  more  dramatic  counterexample.  In  his  example,  there  exists  an 
average  optimal  policy  in  II^o -  Nevertheless  he  exhibits  an  /  G  IIsd  and  a  /30  G  (0,1) 
such  that  /  is  /3-discount  optimal  for  all  /3  G  (/3o,l),  but  it  is  not  average  optimal.  Fisher 
and  Ross  [65]  have  presented  a  counterexample  which  shows  that  the  optimal  policy  need 
not  be  stationary  or  deterministic.  We  refer  to  [146]  for  several  other  counterexamples.  It 
is  apparent  that  the  average  cost  problem  is  closely  related  to  the  ergodic  behavior  of  the 
process  and  it  is  well  known  that  the  ergodic  theory  of  Markov  processes  on  a  countable 
state  space  is  much  more  involved  than  on  a  finite  state  space;  for  example,  a  Markov 
process  on  a  finite  state  space  cannot  be  null  recurrent.  Another  vital  difference  in  this 
case  is  that  the  number  of  stationary  deterministic  policies  is  no  longer  finite.  To  study  the 
ergodic  theory  some  recurrence  conditions  are  necessary.  There  are  many  such  conditions 
available  in  the  literature  [26],  [51],  [174];  we  will  survey  a  few  representative  ones. 

In  what  follows  the  state  space  S  =  {0,1,2,...}.  For  each  i  G  S,  the  action  space  U(i) 
is  a  prescribed  compact  metric  space.  We  will  always  assume  that  for  fixed  i,j  G  S,  c(i,  -), 
P(i  |  j,  •),  are  continuous.  These  conditions  can  be  weakened  or  dropped  in  several  places, 
as  will  be  clear  from  the  specific  context. 

Derman  [37]  studied  the  ACOE  which,  with  p  a  scalar  and  h  :  S  — * ■  R,  takes  the  form: 
(5.1)  p  +  h(i)  =  min^|c(f,a)  +  Y  P{j  I  i,a)h(j)  j. 

A  solution  to  (5.1)  is  a  pair  {p,h)  satisfying  it. 

Suppose  /  G  IIsd  is  a  minimizing  selector  in  (5.1).  Then  (5.1)  becomes 

(5.1')  p  +  hW  =  c(ij(i))  +  Yp{. i  I  *\/(*'))*0')- 

jes 
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Equation  (5.1')  asserts  that,  apart  from  p,  the  cost  if  the  process  stops  now  equals  the 
expected  cost  if  it  continues  under  the  policy  /  for  just  one  more  period.  We  can  give  a 
similar  interpretation  to  (5.1).  Hence,  we  may  think  that  p  is  the  average  cost  under  /  and 
no  other  /  €  nsr>  has  a  smaller  average  cost.  Thus,  the  function  h  in  (5.1)  is  roughly  a 
measure  of  how  much  we  are  prepared  to  pay  to  stop  the  process,  though  continuing  to  pay 
an  average  cost  p  in  the  future  [137]  (cf.  Remark  4.2,  (a)).  Therefore,  the  function  h  may 
be  viewed  as  a  cost  potential.  Also,  by  a  stochastic  representation  of  /i,  using  (5.1)  and 
(5.1'),  h  is  indeed  a  potential.  Hordijk  [89]  has  pursued  this  line  of  thought  in  great  detail, 
which  we  will  discuss  later. 

We  start  with  a  characterization  of  optimal  policies. 

Theorem  5.1.  If  the  ACOE  h as  a  solution  ( p,h )  satisfying 
(5.2)  lim  jEfh(Xt)  =  0 ,  VttGIIsd,  ViGS, 

t—+oo  t 

then  there  exists  an  /  €  Ilso  such  that 


p  =  J(i,f)  =  J*(i),  Vies. 

Moreover,  an  f  e  nS£>  is  average  optimal  if  for  each  i  £  S 

(5.3)  c(», /(*'))  +  ]TP(j  |  i,f(i))h(j )  =  min  (c(i,a)  +  P(j  |  i,a)h(j) ), 

a6£/(,)  l  its  J 

and,  conversely,  if  an  f  £  IT s d  is  average  optimal  and  the  corresponding  chain  is  irreducible 
and  positive  recurrent  then  (5.3)  holds. 

Proof.  Let  /  6  IIsd  satisfy  (5.3).  Then  since 

E{[h(Xt+ 1)  |  3t\  =  Y,P0  I  Xt,f(Xt))h(j), 
jes 


it  follows  from  (5.1)  and  (5.3)  that 


(5-4)  p  +  h(Xt)  =  c{Xuf(Xt))  +  E{  [h(Xt+ 1)  |  &] . 


Summing  (5.4)  from  t  =  0  to  N  —  1,  dividing  by  N  and  taking  expectations,  we  obtain 

E{[h(XN)}  -h(i) 

+  N 

Next,  letting  N  —*  oo  and  using  (5.2),  yields 


p  =  —E? 
H  N  ' 


A  —  1 


^2c(Xt,f(Xt)) 


t= o 


p  =  lim 

N—*oo 


Y^c(XtJ(Xt)) 


L  t=o 
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On  the  other  hand  if  -k  is  any  other  policy,  we  can  show  using  the  same  arguments  that 

P  <  limsup  -j-  Ef 

Hence,  /  is  average  optimal.  Conversely,  let  /  €  IIsd  be  average  optimal  and  suppose  that 
the  corresponding  chain  is  irreducible  and  positive  recurrent.  If  /  does  not  satisfy  (5.3) 
then  there  exist  io  €  5,  ao  €  U(io)  and  5  >  0  such  that 


TV— 1 


£<<*«.*) 


lt=o 


(5-5)  c(io,/(*o))  +  ^2  |*o»/(*o))Mi) 

jes 

=  c(i0,a0)  +  ^2  I  *o,oo)h(i)  +  <5  • 

ies 


Let  /'  6  nso  be  defined  as  follows: 


m 


f(i)  if  i  ^  io, 

do  if  i  =  io . 


Then,  using  (5.5)  along  with  irreducibility  and  positive  recurrence,  it  is  easily  seen  that 


which  contradicts  the  average  optimality  of  /.  □ 

Remark  5.1. 

(a)  We  say  that  (5.1)  admits  a  bounded  solution  if  h(-)  is  bounded.  If  the  ACOE  has 
a  bounded  solution,  then  (5.2)  is  clearly  satisfied;  moreover,  using  the  mcixtingale 
stability  theorem  [114,  p.  53]  it  can  be  shown  that  the  /  6  Hsd  selecting  the 
minimum  in  (5.3)  is  sample  path  average  optimal  [70]. 

(b)  Various  extensions  of  last  assertion  of  Theorem  5.3  have  been  obtained  by  Sennott 
[154]- 


Derman  and  Veinott  [42]  have  prescribed  a  certain  recurrence  condition  which  ensures 
that  (5.1)  admits  a  bounded  solution.  We  will  discuss  it  later  in  this  section.  The  ACOE 
resembles  the  dynamic  programming  equation,  and  Theorem  5.1  is  analogous  to  a  dynamic 
programming  characterization  of  an  optimal  policy.  However,  the  dynamic  programming 
heuristic  does  not  lead  directly  to  the  ACOE.  Taylor  [173]  developed  a  vanishing  discount 
approach  for  a  particular  problem  which  was  extended  for  the  general  case  by  Ross  [143]- 
[146].  Our  presentation  here  follows  Ross  [146].  As  noted  earlier,  the  average  case  can  in 
some  sense  be  treated  as  the  limiting  case  of  the  discounted  problem  as  the  discount  factor 
approaches  one.  The  discounted  value  function  Jp(-)  satisfies  the  DCOE  (cf.  Theorem  2.1) 

Jp(l)  =  min  {  c(i,  a)  +  0^2  P{j  I  i,a)Jp(j)\ 

aeU(i)  \  f— 1  ) 
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and  a  /^-discounted  optimal  policy  selects  a  minimizing  action.  One  possible  way  of  finding 
an  average  optimal  policy  might  be  to  choose  the  actions  minimizing 

lim  |  c(i,a)  +  PJ2  I  i^a)JpU) 

However,  this  limit  need  not  exist  and  indeed  would  often  be  infinite  for  all  actions.  The 
situation  can  nevertheless  be  salvaged  by  considering  a  ‘differential’  discounted  value  func¬ 
tion,  i.e.,  hp(i)  :=  Jp{i)  -  Jp( 0),  where  0  €  S  is  an  arbitrary,  fixed  state.  The  function 
hp(-)  satisfies 

(5.6)  (1  -  0)Jp(Q)  +  hp(i)  =  min  ( c(z,  a)  +  0  £  pti  I  a)hp(J) }  • 

a6t/(.)  I  f^s  J 

From  (5.6)  it  is  now  apparent  that  (5.1)  can  be  derived  under  certain  conditions  by  letting 
j3  —*  1.  Indeed  we  have  the  following  result  [146]. 

Theorem  5.2.  Suppose  there  exists  a  constant  I(  >  0  such  that  |/i^(z)|  <  K ,  for  all 
j3  €  (0, 1)  and  i  €  S.  Then 

(i)  the  ACOE  admits  a  bounded  solution  (p,h). 

(ii)  For  some  sequence  /3n  —  1,  h(i)  —  limn.,^  hpn(i),  i  £  S. 

(iii)  lim,a_i  (1  -  /3)Jp(i )  =  p  for  any  i  €  S. 

Proof.  Let  f3n  f  1  be  given.  By  the  uniform  boundedness  of  hp(-),  using  a  diagonalization 
procedure,  we  can  find  a  subsequence,  which  for  simplicity  we  also  denote  by  /?„,  such  that 
hpn( i)  —  h(i)  for  each  i  €  S,  where  h(-)  is  a  bounded  function.  Again,  since  (1  —  fin)Jpn  (0) 
is  bounded,  there  is  a  further  subsequence  (3nk  j  1  such  that 

lim  (1  -  Pnk)Jp  (0) 

k— oo  k 

exists,  (i)  then  follows  from  (5.6)  and  an  application  of  the  dominated  convergence  theorem. 
Furthermore,  by  Theorem  5.1,  p  is  the  minimum  average  cost.  Since  the  above  results  are 
independent  of  the  sequence  chosen,  then  (iii)  follows.  □ 

Remark  5.2.  It  has  been  shown  [62]  that  if  the  ACOE  has  a  bounded  solution  then  there 
exists  a  constant  I\  >  0  such  that  \hp(i)\  <  K  for  all  j3  €  (0, 1),  i  6  S. 

5.1.  Bounded  costs.  In  this  subsection  we  assume  that  c(- ,  •)  is  bounded.  Ross  [146]  has 
proved  that  under  a  Derman-Veinott  [42]  type  recurrence  condition  (see  (5.7)  below),  the 
uniform  boundedness  hypothesis  of  Theorem  5.2  is  satisfied. 

Theorem  5.3.  Let  f  6  IIsd  and  let  {Xj}  be  the  corresponding  state  process.  Let 

t  =  min{t  >  1  :  Xt  —  0}. 

If  there  exist  a  K  >  0  such  that 

(5-7)  E{[t]<K 
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for  all  f  G  IIsc  and  all  i  G  S,  then  hp(i)  is  bounded  uniformly  in  (3  G  (0,1)  and  i  £  S. 
Proof.  Let  (3  G  (0, 1)  and  fp  G  IIsd  be  /3-discount  optimal.  We  have 


W  =  El 


Is 


YJfitc{Xuf0{Xt)) 


(5.8) 


=  E. 


Is 


,t= 0 
V  — 1 


'£Fc(x„u(xt)) 


lt=o 
Is 


+  EI 


Is 


J2^c{XtJ0(Xt)) 


t=T 


<  meIs  [t]  +  j;(o)e{0  {(3t\  . 

where  M  is  a  bound  on  c(-  ,•).  From  (5.7)  and  (5.8)  it  follows  that 


(5.9)  J^(i)  -  /3J^(0)  <  Mlf  . 

Also,  from  (5.8)  and  applying  Jensen’s  inequality,  we  get 

Jp(i)  >  Jp(O)E!0  [/3T]  >  Jp(0)(3K. 

Therefore, 

(5.10)  ^(0)-W)<(l-^W) 

<{1-0k)JL.<mk. 

The  desired  result  follows  from  (5.9)  and  (5.10).  □ 

After  the  wrork  of  Derman  [37],  Derman-Veinott  [42],  and  Ross  [143],  [144],  several  re¬ 
currence  conditions  have  appeared  [174].  We  will  look  into  a  few  representative  ones. 

Let  /  G  Usd-  For  a  finite  set  A  C  5,  let 


(5.11)  T4  =  min{t  >  1  :  Xt  G  A}. 

(A5.1)  There  is  a  finite  A  C  S  and  a  constant  K  >  0  such  that 

E[  [ta]  <  K 

for  all  i  G  S  and  /  G  IIsd-  Further,  for  any  /  G  IIsd  the  corresponding  process 
does  not  have  two  disjoint  invariant  sets. 

(A5.2)  There  exists  a  constant  K  >  0,  and  for  every  /  G  IIsd  there  is  a  state  j(f)  G  5 
such  that 

E(  [t{jU)}]  <  K  ’  V  i  G  S  . 

(A5.3)  (Simultaneous  Doeblin)  There  is  a  finite  set  A ,  an  integer  n  >  1  and  a  scalar  a  >  0 

such  that 

Pti  I  *»/(*))  t  a 

j€A 
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for  all  i  G  S  and  all  /  6  IIsd-  Further,  for  any  f  £  Usd  the  corresponding  process 
does  not  have  two  disjoint  invariant  sets. 

(A5.4)  (Scrambling)  There  is  an  integer  n  >  1  and  a  scalar  a  >  0  such  that  for  any  /  G  IIsd 
£min{P"//),Pf2>i(/)}  >  «,  Vn,i2  e  5. 

(A5.5)  (Ergodicity)  There  is  an  integer  n  >  1  and  a  scalar  p  >  0  such  that  for  each  /  €  IIsd 
there  exists  an  r/(/)  G  T(S)  for  which 

£>”</)  -  1(/)0)I  <  2(1  -  />)lm/"J 

3 

for  all  i  G  S  and  m  >  1,  where  [ij  denotes  the  largest  integer  not  exceeding  x. 

Remark  5.3.  Clearly  (A5.1)  and  (A5.2)  are  generalizations  of  the  Derman-Veinott  condition. 
Hordijk  [89]  has  proved  the  existence  of  a  bounded  solution  to  the  ACOE  using  (A5.1). 
Under  (A5.5),  for  each  /  G  IIsd,  v(f)  is  the  unique  invariant  measure  of  the  corresponding 
process. 

Federgruen  et  al.  [51]  have  established  the  following. 

Theorem  5.4.  The  conditions  (A5.1  )-(A5.3)  are  equivalent.  Also,  if  for  any  f  G  IIsd  the 
corresponding  process  is  aperiodic,  then  (A5.1)-(A5.5)  are  equivalent. 

Remark  5-4-  Under  any  one  of  the  conditions  (A5.1)-(A5.5),  Federgruen  et  al.  [51]  have 
established  the  existence  of  a  bounded  solution  to  the  ACOE  by  extending  the  vanishing 
discount  approach  of  Taylor  and  Ross. 

We  have  thus  far  seen  several  recurrence  conditions  which  are  sufficient  for  the  ACOE  to 
admit  a  bounded  solution.  Cavazos-Cadena  [30],  [31]  has  dealt  with  the  converse  question 
of  what  are  the  necessary  recurrence  conditions  for  the  ACOE  to  have  a  bounded  solution. 
He  has  obtained  the  following  result.  Consider  the  condition: 

(A5.6)  There  exists  a  constant  K  >  0  such  that  for  each  bounded  and  measurable  c  : 
SxA^K  and  every  collection  {U(i)  :  i  G  5},  U(i)  C  A,  there  exist  p  G  R  and 
h  :  S  — ►  R  bounded  which  solve  (5.1)  and  satisfy  \\h\\  <  ii'||c||,  where  ||  •  ||  is  the  sup 
norm. 

Theorem  5.5.  The  conditions  (A5.2)  and  (A5.6)  are  equivalent. 

The  proof  follows  by  an  application  of  the  uniform  boundedness  principle.  For  details 
and  other  variants  we  refer  to  [30],  [31].  Thus,  an  assumption  on  the  existence  of  a  bounded 
solution  to  the  ACOE  necessarily  imposes  a  strong  recurrence  structure  on  the  system. 
Also,  note  that  (A5.6)  involves  not  just  one  CMP  but  a  family  of  CMP  (one  for  each  c 
and  {[/(i)}).  Since  it  is  equivalent  to  (A5.1)-(A5.3)  and  under  aperiodicity  conditions  to 
(A5.1 )— ( A5.5),  it  follows  that  (A5.1)-(A5.5)  are  too  strong  for  many  important  applications. 
In  fact  there  are  interesting  situations  [20]  in  which  these  conditions  are  not  satisfied  but 
for  which  one  can  find  average  optimal  stationary  deterministic  policies. 
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Ross  [144]  has  proved  that  under  the  following  recurrence  condition  the  AC  can  be 
reduced  to  an  appropriate  DC.  Therefore,  in  view  of  Theorem  2.1,  the  problem  can  be 
resolved  in  this  case. 

Theorem  5.6.  If  there  exists  a  constant  a  >  0  such  that 


P( 0  j  i,a)  >  a  >  0 


for  all  i  e  S,  a  e  U(i),  then  the  AC  can  be  reduced  to  an  appropriate  DC. 

Proof.  Let 

5,  ...  v  f  (1  -  I  v)  for  j  ^  0 

[  (1  —  a)-1(P(0  |  i,-)  —  a)  for  j  =  0. 

Let  Jg(-)  denote  the  /3-discounted  value  function  for  the  CMP  with  cost  c(- ,  •)  and  transition 
law  P{-  |  • ,  Then  it  is  easily  verified  that  for  each  i  £  S 


Q^i-o(0)  +  JLM)  = 


mm 

a€C/( 


in{c(z,a)  +  ^P(j 
v  jes 


Let  /  £  Usd  be  (l-a)-discount  optimal  for  the  modified  CMP.  It  follows  from  Theorem  5.1 
that  /  is  AC-optimal  for  the  original  CMP  and  the  optimal  average  cost  is  cc71*_a(0).  □ 

Remark  5.5.  Note  that  if  the  ACOE  has  a  bounded  solution  (p,h)  then  p  is  the  optimal 
average  cost  for  any  initial  condition.  Hence,  the  existence  of  a  bounded  solution  to  the 
ACOE  suggests  that  some  kind  of  ‘unichainedness’  is  in  effect,  since,  for  the  multi-chain 
case  the  average  cost  would,  in  general,  depend  on  the  initial  condition.  The  multi-chain 
version  of  the  ACOE  is 


(5.12a) 


min 

aeu(i) 


a)p(j)  =  p(t') 

jes 


(5.12b) 

where 
(5.12c) 

This  equation  has  been  studied  by  Zijm  [204]  for  countable  state  space.  For  more  general 
state  spaces  it  was  extensively  studied  much  earlier  by  Yushkevich  [200]  (see  also  [50]);  this 
work  will  be  discussed  in  the  next  section. 

If  (5.12)  has  a  bounded  solution  p(i),  h(i),  where  both  p  and  h  are  bounded  functions, 
then  one  can  show  as  before  that  p(i)  is  the  optimal  average  cost  starting  from  state  i  £ 
S  and  a  minimizing  selector  in  (5.12)  yields  an  average  optimal  stationary  deterministic 


p(i)  +  h(i)  =  min  {  c(i,  a)  +  V  P(j  |  i,  a)h(j)  1 
a€t/l(<)  l  fts  } 

Ui{i)=  la  £  U(i)  :  min  ^  P(j  |  i,a)p(j)  =  p(i)\. 

I  0  ) 
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policy.  Under  a  certain  ‘geometric  convergence  condition’  Zijm  [204]  has  established  the 
existence  of  a  bounded  solution  to  (5.12).  Under  the  additional  assumptions  that  under 
any  stationary  deterministic  policy  the  corresponding  process  has  at  most  a  finite  number 
of  ergodic  classes,  he  has  shown  that  the  geometric  convergence  condition  is  equivalent  to 
a  number  of  recurrence  conditions  of  the  type  (A5.1)-(A5.5). 

Hordijk  [89]  has  established  the  existence  of  an  average  optimal  /  £  IIsd  without  utilizing 
the  ACOE.  Let  IIsd  be  endowed  with  the  product  topology.  Then  IIsd  is  compact  and 
metrizable.  Let  us  consider  the  following  assumptions. 

(A5.7)  For  each  /  £  IIsd  and  i  £  S,  there  exists  a  measure  *?,(/)  6  'P(S')  such  that 

-Pn(/)(ui)- 

(A5.8)  /  i — >  T]i(f)  is  continuous  for  any  i  £  S. 

(A5.9)  For  each  i  £  S,  (ry,(/)  :  /  £  IIsd)  is  tight  (for  a  definition  of  tightness,  see  [130, 
Def.  3.1,  p.  28]). 

(A5.10)  For  each  /  £  IIsd5  the  corresponding  process  is  recurrent. 

(A5.ll)  For  each  /  £  IIsD)  the  corresponding  process  does  not  have  disjoint  invariant  sets. 
(A5.12)  {P(/)(v)  :  i  £  S,  f  £  IIsd}  is  tight. 

It  is  easy  to  see  that  (A5.7)-(A5.8)  imply  that  for  each  i  £  S,  {r]i(f)  :  f  £  IIsd}  is 
compact.  Hence,  in  particular,  (A5.7)-(A5.8)  (A5.9).  By  definition,  (A5.9)  =>  (A5.7). 

Also,  it  can  easily  be  shown  that  (A5.9)  and  (A5.ll)  =>■  (A5.8),  and  (A5.12)  =>•  (A5.9). 
However,  (A5.12)  may  be  easier  to  verify. 

Theorem  5.7.  Each  of  the  following  five  combinations  of  assumptions  is  sufficient  for  the 
existence  of  an  average  optimal  f  £  Hsd-'  (A5.7,  A5.8),  (A5.9,  A5.10),  (A5.9,  A5.ll), 
(A5.10,  A5.12),  ( A5.ll ,  A5.12). 

Remark  5.6.  The  main  idea  behind  the  proof  of  this  theorem  can  be  traced  back  to  the 
proof  of  Theorem  4.3.  We  give  the  main  points  and  skip  the  details.  Let  (3n  £  (0, 1)  be 
a  sequence  such  that  /3n  |  1,  let  f@n  £  Hsd  be  /3„-discount  optimal,  and  be  a  limit 
point  of  {fpn}  in  Hsd-  Suppose  that  p*(0  is  a  scalar  satisfying  (1  —  f3n)Jpn(i)  — ►  p*(i),  for 
each  i  £  S  (along  a  suitable  subsequence).  Then  by  using  Tauberian  and  ergodic  theorems, 
one  deduces  that  J“(i)  —  p*(i)  and  /«*>  is  average  optimal  under  (A5.7,  A5.8).  Under 
(A5.9,  A5.10),  /oo  is  average  optimal  for  initial  states  i  £  S  :=  1J.  suPP(7?*(/oo)) ;  where 
‘supp’  denotes  the  support.  Then  by  (A5.10)  there  exists  an  /  such  that  the  corresponding 
process  starting  from  any  i  £  S  \  S  reaches  S.  Set 


7(0  = 


/(»)  if  *  i  s 

fx{i)  if  i  £  S. 


It  follows  that  /  is  average  optimal.  The  other  cases  can  be  dealt  with  in  a  similar  manner. 

5.2.  Unbounded  costs.  We  have  thus  far  considered  bounded  costs  only.  There  are 
practical  situations  (e.g.  in  queueing  systems)  where  the  cost  is  typically  unbounded.  We 
assume  that  c  >  0  (cf.  (A2.1)).  Let  us  now  consider  the  ACOE  for  unbounded  c.  Note  that 
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the  boundedness  of  c  did  not  play  any  role  in  the  proof  of  Theorem  5.1.  For  unbounded  c 
the  ACOE  is  unlikely  to  admit  a  bounded  solution. 

Lippman  [112],  [113]  has  studied  controlled  semi-Markov  processes  with  unbounded  costs. 
He  has  placed  polynomial  bounds  on  the  movement  of  the  process  in  one  transition.  He 
has  made  a  further  assumption  that  there  exists  an  /  £  Hsd  such  that  both  the  mean  first 
passage  times  and  mean  first  passage  costs  from  any  state  i  to  state  0  under  the  policy  are 
finite.  Moreover,  if  /  £  ns#  is  close  to  /3-discount  optimal  for  a  sequence  of  discount  factors, 
then  it  is  AC-optimal.  Lippman  has  employed  the  vanishing  discount  approach  of  Taylor 
and  Ross  to  establish  the  existence  of  a  solution  (p,h)  to  the  ACOE  with  h  satisfying  (5.2), 
thereby  establishing  the  existence  of  an  average  optimal  /  £  Hsd-  He  has  also  given  some 
examples  from  queueing  systems  where  his  conditions  are  satisfied.  However,  his  condition 
on  the  /3-discounted  value  function  appears  to  be  very  difficult  to  verify. 

Hordijk  [89]  has  used  a  Lyapunov  stability  condition  to  establish  the  existence  of  an 
average  optimal  /  £  Hso. 

(A5.13)  (Lyapunov  Condition)  Let 

P(WJ)=\  n 

l  0,  J  =  o. 

There  exists  a  function  w  :  S  — *•  R+  such  that,  for  all  i  £  S , 

(i)  c(*, /(*))  +  1  +  Ej  P(f)(iJ)w(j)  <  for  aH  /  6  Hsd- 

(ii)  Ylj  P{f)(hj)w(j)  is  continuous  in  /. 

(iii)  lim„_00  Pn{f)(iJ)w{j)  =  0. 

Theorem  5.8.  Under  the  above  Lyapunov  condition  there  exists  an  AC-optimal  f  £  n5£>. 
Proof  (Sketch).  Let  /  £  Hsp.  For  i  £  S,  we  define 

rt-  =  min{t  >  1  :  Xt  =  i}, 

where  Xt  is  governed  by  /.  Then  under  (A5.13),  using  the  standard  techniques  of  stochastic 
Lyapunov  function  method  [105],  [89],  the  following  results  can  be  proved: 

(5.13)  E{  [t0]  <  w(i) 


(5.14) 


E 


'r0~l 

c(xt,f(xt)) 


L  t=o 


<  w(i) . 


Indeed,  with  n  £  N  and  n  >  1, 


E{  [u;(A'nATo)  |  ^nAro]  -  =  -E{ 


’nATo- 1 

E  E{[w(Xt+l)\Xt]-w(Xt) 
.  t= o 


<  -E{  [n  A  r0], 
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where  the  last  inequality  is  due  to  (A5.13).  Hence, 

E-  [n  A  ro]  <  w(i). 


and  letting  n  f  oo,  (5.13)  follows.  Also,  (5.14)  can  be  proved  along  the  same  lines.  By  an 
ergodic  theorem  [129] 


lim 

N-+oo 


mN- 1 

£  c(x„f(x,)) 


L«=o 


=  (£0/r0)-1£0/ 


=  :P(/) 


>0-1 

£c(x„/(x«)) 


Lt=o 


Let 


Then  p*  <  w( 0).  Define 


inf 

fensD 


/>(/)- 


h(i ) 


inf 

/enSD 


E 


>0-1 

£(c(x„/(x,))-/) 

.  i= 0 


Then  h(0)  =  0.  Using  (5.13),  (5.14)  and  (A5.13,  iii),  it  can  be  shown  that  ( p*,h )  is  a 
solution  of  the  ACOE  with  h  satisfying  (5.2),  and  the  desired  result  follows.  □ 

Remark  5.7. 

(a)  Note  that  by  (A5.13,  i)  the  cost  function  c  does  not  grow  faster  than  the  Lyapunov 
function  w.  Thus,  there  is  a  restriction  on  the  growth  of  c  imposed  by  w.  In  CMP 
w(i)  =  i,  w(i)  =  i2  are  typical  examples  of  Lyapunov  functions  [89].  In  the  latter 
case,  for  example,  we  can  treat  only  those  unbounded  cost  functions  which  do  not 
grow  faster  than  quadratic  functions. 

(b)  The  condition  (A5.13,  iii)  is  crucial  in  showing  that  the  cost  potential  h  satisfies 
limi_*oo  7 E-  h(Xt)  -  0,  for  all  /  6  IIso,  and  i  6  S. 

Federgruen,  Hordijk  and  Tijms  [52]  have  extended  Hordijk’s  results  by  replacing  the 
single  attracting  point  {0}  by  a  finite  set  K  C  S.  Their  main  assumption  is:  There  exists 
a  finite  set  Ii  C  S  such  that  for  each  initial  state  i  G  S  the  suprema  over  the  mean  hitting 
time  of  K  and  mean  hitting  costs  are  finite.  This  in  turn  is  equivalent  to  the  existence  of  a 
Lyapunov  function  w  :  S  — *  R+  satisfying  (A5.13,  i)  where  now  P  is  defined  as: 


),  J  tK,fe  Usd 
0,  ;  G  K. 


Under  the  additional  assumptions  that  (A5.13,  ii)  and  (iii)  hold,  and  the  ‘communication 
condition’  that  for  any  f  €  Usd  corresponding  process  has  no  two  disjoint  invariant 
sets,  they  have  established  the  existence  of  a  solution  (p,h)  to  the  ACOE  by  employing 
the  vanishing  discount  approach  and  have  shown  that  h  satisfies  (5.2).  This  work  has 
been  further  extended  by  Federgruen,  Schweitzer  and  Tijms  [53].  They  have  dropped  the 
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unichainedness  assumption  in  [52].  Instead,  they  assume  that  any  state  can  be  reached 
from  any  other  state  via  some  policy.  Under  this  and  other  conditions  in  [52]  they  have 
established  the  existence  of  a  solution  (p,/i)  to  the  ACOE,  with  h  satisfying  (5.2).  They 
have  deviated  from  the  vanishing  discount  approach  and  have,  instead,  utilized  Tychonoff’s 
fixed  point  theorem  in  their  analysis.  We  again  note  that  in  all  these  investigations,  a 
restrictive  growth  condition  on  the  cost  function  is  imposed,  as  noted  in  Remark  5.7. 

The  Lyapunov  stability  condition  necessarily  imposes  a  blanket  stability  (i.e.,  positive 
recurrence)  of  certain  states  (cf.  (5.13))  which  may  be  very  restrictive.  On  the  other  hand 
(5.2)  is  not  easy  to  verify  in  general  and,  indeed,  may  not  hold  in  the  case  of  many  queueing 
models  [137].  Another  generalization  of  the  boundedness  of  the  solution  of  the  ACOE  could 
be  boundedness  from  below.  This  will  be  the  case  if  the  cost  function  has  some  ‘monotone’ 
properties,  which  naturally  arise  in  various  queueing  models.  This  line  of  thought  has  been 
pursued  in  various  ways  in  [24],  [28],  [72],  [74],  [75],  [137],  [138],  [151],  [152],  [168],  [186]. 

Sennott  [151],  [152]  has  prescribed  very  general  conditions  in  this  direction.  We  will  now 
briefly  describe  them.  Consider  the  following  assumptions: 

(A5.14)  For  every  i  £  S  and  every  j3  £  (0, 1),  Jp(i)  <  oo. 

(A5.15)  There  exists  a  nonnegative  integer  L  such  that 

MO  0)>-L. 

(A5.16)  There  exists  a  function  M  :  S  ~+  R+  such  that  hp(i)  <  M(i )  for  all  i  £  S  and  any 
/ 3  £  (0, 1).  For  every  i  £  5  there  exists  an  a(i)  £  U(i)  such  that 

Y  P{j  |  <  oo  . 

3 

Theorem  5.9.  Under  (A5.14)-(A5.16),  there  exists  an  AC-optimal  f  £  II50. 

Proof.  Let  j3n  £  (0, 1)  be  such  that  0n  f  1.  Let  fpn  be  /?„-discount  optimal.  Let  /  be  a  limit 
point  of  fgn  as  n  — >  00.  In  order  to  simplify  the  notation  all  subsequences  of  /3n  will  also 
be  denoted  by  /?„.  By  (A5.16),  and  a  diagonal  argument,  there  exists  a  function  h  :  S  — >  R 
such  that  limn_tco  hpn(-)  —  h(-).  By  (A5.15),  h(-)  >  —L.  Let  p  :  S  — *•  R+  be  a  function 
such  that  limn_oo  (1  -  /3n)Jpa{i)  =  p(0-  Using  (A5.16)  it  is  easy  to  see  that  p(i)  =  p*,  a 
constant.  Now  for  i  £  S, 

(5.15)  (i  -  (3n)JK o)  +  M(0  =  c(*,/^(0)  +PnY  I  *'«//3.(0 )M0‘)  • 

res 

Fix  an  i  £  S.  Adding  L  to  both  sides  to  make  ( hpn(i )  +  L)  >  0  and  taking  ‘liminf’  on  both 
sides  of  (5.15).  Then  by  Fatou’s  lemma  and  the  assumption  of  continuity  of  P(j  \  i,  ■)  we 
conclude  that 

p*  +  h(i)  >  c(i, /(*))  +  I 

j 
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Since  h(-)  is  bounded  below  the  proof  of  Theorem  5.1  can  be  modified  to  show  that  J(i,f )  < 
p* .  By  Theorem  A.2,  J(i,7 r)  >  pm  for  any  7r  £  II.  Hence,  J(i,f )  =  J“(t)  =  p*  and  /  is 
AC-optimal.  □ 

Remark  5.8. 

(a)  From  the  above  proof  it  is  clear  that  if  p  is  a  scalar  and  h  :  S  — ►  R  is  bounded  below 
and 

(5.16)  p  +  h(i)>  min  (c(i,a)  +  Vp(j  |  i,a)/i(j)l, 

a6t/(i)  l  j  J 

then  p  is  the  optimal  average  cost,  and  any  f  £  Usd  selecting  the  minimum  on  the 
right-hand  side  of  (5.16)  is  AC-optimal.  In  this  case,  we  may  replace  the  ACOE  by 
an  average  cost  optimality  inequality  (ACOI)  viz.  (5.16). 

(b)  If,  for  each  i  £  S,  U(i)  is  finite,  then  in  the  above  proof  fpn(i)  =  f(i)  for  large  n. 
Then  we  can  write  for  large  n 

p  +  h(i)  =  c(ij(i))  +  I  0’)- 

j 

By  Fatou’s  lemma 

p+h(i)  >  c(*,/(»))  +  I 

j 

Consider  the  stronger  assumption 

(A5.17)  Condition  (A5.16)  holds  and  Ylj  PU  I  <  oo,  for  all  a  £  A  and  i  £  S. 

Under  (A5.17),  using  dominated  convergence,  it  is  easy  to  see  that 
P  +  M*)  =  m,in  c(i,  a)  +  V  P{j  \  i,  a)h(j) 

and  we  obtain  the  ACOE.  If  for  each  i  £  S,  there  is  a  finite  set  R{  C  S  such 
that  P(j  |  i,-)  =  0  for  j  £  Ri,  then  (A5.17)  will  obviously  hold.  Such  a  condition 
is  satisfied  for  systems  whose  dynamics  have  a  nearest-neighbour  motion  property 
[28]. 

(c)  If  there  exists  an  /  £  IIsd,  under  which  the  process  is  ergodic,  irreducible  with  an 
invariant  measure  rj(/)  £  V{S),  and  c(i,  f{i))r](f)(i)  <  oo,  then  (A5.14)  and 
(A5.16)  hold.  (A5.15)  holds  if  J^(i)  is  increasing  in  i.  Direct  conditions  implying 
(A5.14)-(A5.17)  can  be  found  in  [28],  [32],  [34],  [74],  [75],  [151],  [152],  [168],  [186]. 
See  also  [137],  [138]. 

(d)  Let  /  £  IIsd  be  a  policy  which  attains  the  minimum  on  the  right-hand  side  of  (5.16). 
Fix  an  i  £  S.  If  the  chain  under  /  is  positive  recurrent  at  i,  then  one  can  show 
that  equality  holds  at  i  in  (5.16).  However,  the  lack  of  positive  recurrence  at  i  may 
lead  to  strict  inequality  in  (5.16).  Cavazos- Cadena  [33]  had  exhibited  an  example 
to  demonstrate  this.  He  has  further  shown  in  his  example  [33]  that  (A5.14)-(A5.16) 
are  satisfied,  but  the  ACOE  does  not  admit  any  solution. 
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5.3.  The  convex  analytic  approach.  We  will  now  describe  Borkar’s  convex  analytic 
approach  for  the  average  cost  case  [20]— [26].  The  convex  analytic  approach  to  the  AC- 
problem  is  a  natural  extension  of  the  linear  programming  approach  when  the  state/action 
spaces  are  no  longer  finite.  In  this  approach  one  views  the  control  problem  as  the  problem 
of  minimizing  a  linear  functional  on  the  convex  set  of  ‘ergodic  occupation  measures’  to 
be  defined  shortly  [20]— [26].  This  approach  can  also  be  used  to  treat  other  standard  cost 
criteria,  but  it  may  be  more  involved  for  treating  cases  such  as  the  DC  criterion.  On  the 
other  hand,  it  is  more  flexible  and  powerful  for  certain  other  purposes,  e.g.  pathwise  average 
cost,  constrained  optimization  problem,  etc.  Since  the  techniques  involved  here  are  entirely 
different  from  what  we  have  thus  far  followed,  we  will  embark  on  a  more  detailed  discussion. 

By  replacing  each  U(i)  with  n*.f/(&)  and  P(j  \  i,  •)  by  its  composition  with  the  pro¬ 
jection  Ilfcf7(A;)  — >  U(i),  we  may  and  will  assume  that  the  f/(i)’s  are  replicas  of  a  fixed 
compact  metric  space  A.  We  say  that  an  /  £  IIsk  is  stable  if  the  corresponding  process  is 
positive  recurrent.  We  will  assume  that  under  an  /  £  IIsh  the  process  has  S  as  its  single 
communicating  class.  (This  can  be  relaxed  in  some  cases;  see  [26]  for  a  discussion  on  this.) 
Therefore,  /  will  have  a  unique  invariant  measure  r](f)  £  'P(S')  satisfying 


v(f)P(f)  =  Ti(f). 


Let  II5SK  denote  the  space  of  stable  stationary  policies.  IIss£>  is  defined  analogously.  For 
an  /  €  Ilssfl  denote  by  r?(/)  £  V(S  x  A)  the  ‘ergodic  occupation  measure’  defined  by 


SxA 


9dfj(f)  =  2  »?(/)(*)  (  g(i,a)f(i)(da) 
ies  Ja 


for  g  £  Cb{S  X  A).  We  will  consider  the  sample  path  average  cost  optimality  which  is 
stronger  than  the  usual  AC-optimality.  Let 


Ir  =  {?)(/)  :  /  €  IIssk},  Id  =  {v{f)  ■  f  €  IIssd}- 

Note  that  !)(/)  can  only  be  defined  for  an  /  £  LIssk-  To  consider  optimality  in  LI  we  will 
need  to  consider  the  following  empirical  processes.  Let  7r  £  II,  and  let  (Xt,At)  be  the 
corresponding  processes  with  initial  law  g  £  V{S).  Define  the  V{S  X  A)-valued  empirical 
process  {vt}t>\  by 

1  <_1 

(5.17)  vt(CxD)  =  j^/{Xs€C,As€D},  t>  1, 

L  s= 0 

for  C,  D  Borel  in  5,  A,  respectively.  Let  S  —  S  U  {00}  be  the  one  point  compactification 
of  S .  By  abuse  of  notation,  we  may  identify  with  the  element  of  V{S  X  A)  that  restricts 
to  it  on  S  x  A.  Since  V{S  X  A)  is  compact,  {vt},  viewed  as  a  sequence  of  V(S  x  A)-valued 
random  variables,  converges  to  a  sample  path  dependent  compact  limit  set  in  ‘ P(S  X  A). 
We  characterize  this  set  in  Lemma  5.1  below,  the  statement  of  which  calls  for  some  new 
notation.  Note  that  any  element  v  £  x  A)  can  be  decomposed  as 

(5.18)  u{B)  =  6vv'{B  D  (S  x  A))  +  (1  -  S„yr(B  n  ({00}  X  A)) 
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for  B  Borel  in  S  x  A,  8„  G  [0, 1]  is  uniquely  specified  and  v'  G  V(S  X  A)  (respectively, 
v"  G  ^({oo}  x  A))  is  uniquely  specified  if  >  0  (respectively,  Su  <  1).  We  may  render 
v' ,  u"  unique  at  all  times  by  imposing  an  arbitrary  fixed  choice  thereof  when  6„  =  0, 
respectively,  1. 

Lemma  5.1.  Outside  a  set  of  zero  probability  (with  respect  to  V '*),  the  following  holds: 
For  any  limit  point  v  of  {ut}  in  V{S  x  A)  for  which  8U  >  0, 

i/'  =  f )(/) 


for  some  f  G  Ilssfi- 

Proof.  By  the  martingale  stability  theorem  [114,  p.  53] 


,1“  7  £  [r <*.  =  o  -  El  ['<*■  =  1 5-i] 

5=1 

=  7  E [/{X.  =  0  -  E  1  =  i} 


3=1  L 


jes 


=  lim  iq({i}  X  A)  —  /  />(t  (  • ,  •) 
t—oo  y  j 

=  0 ,  a.s., 


for  each  i  €  5.  Consider  a  sample  path  outside  the  set  of  zero  probability  on  which  the 
above  fails  for  any  i  G  S.  Then  for  any  v  as  in  the  statement  of  the  lemma,  we  must  have 

i/({i}  X  A)  >  J  P(i  |  i  G  S . 

Note  that  an  inequality  is  obtained  here,  since  the  second  term  on  the  right  hand  side  of 
(5.18)  is  obviously  nonnegative.  Summing  over  i  G  S  on  both  sides,  it  follows  that  equality 
must  hold.  Decomposing  v'  as  i /(i,da)  =  T7(i)f(i)(da),  where  v  G  'P(S)  is  the  marginal 
on  S  and  i  t — -  f(i)  G  V(A)  is  a  version  of  the  regular  conditional  law  which  defines  an 
element  of  Hsr,  we  obtain 

''(O  = 

j€S 

Hence,  V  =  Tj(f)  and  the  conclusion  follows.  □ 

Lemma  5.2.  The  sets  Ir  and  Id  are  closed ;  also,  Ir  is  convex  and  has  its  extreme  points 
in  Ip. 

Proof.  Let  f}(fn)  G  Ir  and  r)(/„)  — »  v  for  some  u  in  V(S  X  A).  Then  for  all  i  G  S 
V(fn){{i}  X  A)  =  j  P(i\- ,-)df](fn),  n>  1. 

Letting  n  — +  oo 

u({i}  x  A)  =  J  P(i  |  -,-)du. 
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Now  argue  as  in  the  proof  of  the  preceding  lemma  to  conclude  that  v  =  r)(/)  for  some 
/  £  IIssr-  This  proves  that  I r  is  closed.  The  proof  that  Jjr>  is  closed  is  similar.  Let 
fi ,  /2  €  IIssr  and  0  <  A  <  1.  Define  /  €  IIssr  as  follows: 

f(i)  A r?(/i)(Q/i(t)  +  (1  -  A)77(/2)(0/2(0 

Aij(/iKO  +  (l-AM/aXi)  ' 

Then  using  the  properties  of  invariant  measures,  it  is  not  difficult  to  see  that 

*?(/)  =  M/i)  +  (!  ~  A)t?(/2) 

*?(/)  =  *»)(/i)  +  (1  -  A)t)(/2)  , 


showing  that  Ir  is  convex.  Now  let  /  £  IIssr  be  such  that  for  some  io  £  S  and  0  <  A  <  1, 
there  exist  <f>\,<f>2  £  'P(A)  such  that 


io,a)f(io)(da)  =  A  j  P(-  |  i0,a)4>i(da)  +  (1  -  A)  J  P(' 


io,a)02(da) 


P(-  |  i0,a)<f)i(da)  ^ 


jP(-  |  iQ,a)4>2(da) 


Define  /i,/2  €  II sr  as 


/.O')  = 


/0).  j  #  *0 

<t>i,  j  =  *0 


Then  it  can  be  shown  [24]  that  /i,/2  €  IIssr,  and  any  two  of  77(7),  r(/i),  r/(/2)  are  distinct 
from  each  other.  Let  b  £  (0, 1)  be  such  that 


A  =  &T?(/i)(*o)/  (fo?(/i)(io)  +  (1  -  0^/2 )(*o))  • 


Then  we  can  argue  as  before  to  conclude  that 


VU)  =  bfAf 1)  +  (1  -  b)v(f 2)- 


Therefore  r)(/)  is  not  an  extreme  point  of  Ir.  This  implies  that,  for  fi(f')  to  be  an  extreme 
point  of  Ir,  P{-  I  i,a )  must  be  constant  over  a  £  supp(/'(i)),  for  each  i  £  S.  Hence, 
P(f")  =  P(f'),  for  all  f"  £  n5SR  such  that  supp(/"(i))  C  supp  (/'(!)),  for  each  i  £  S. 
In  this  case,  Tj(f")  —  ?](/').  Suppose  that  for  some  i,  say  i  =  1,  there  exist  a  £  (0,1)  and 
<t>\ , 4>2  €  'P(A),  <j> i  ^  00  such  that  /'(l)  =  a0^  +  (1  -  a)02  .  Define  /(,  /2  €  Hssr  by 


/*  = 


if  *  =  1 

/'(*) 


fc  =  1,2. 


It  follows  that  r(/')  =  t?(/[)  =  r?(/2).  It  *s  also  easy  to  check  that 


r)(/')  =  afj{f[)+  (1  -  o)f)(/2), 
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which  contradicts  the  extremality  of  t)(/').  Hence,  /'( 1)  must  be  a  Dirac  measure.  Applying 
this  argument  to  each  i  £  S,  we  deduce  that  /  £  IIsd-  From  this  it  follows  that  the  extreme 
points  of  I  ft  lie  in  Ip.  □ 

We  now  proceed  to  show  the  existence  of  a  sample  path  average  cost  optimal  /  £  IIssd- 
It  is  clear  that  a  blanket  stability  condition  or  some  condition  on  the  cost  that  penalizes 
unstable  behavior  is  required  to  give  the  desired  existence.  For  example,  consider  the  case 
c(i,a)  =  exp(— i)  which  rewards  unstable  behavior.  Then  the  cost  for  any  /  €  nsso  is  a.s. 
positive  while  the  cost  for  an  unstable  /  £  ns£>  is  a.s.  zero,  making  the  latter  optimal.  We 
want  to  rule  out  this  possibility,  as  stability  is  a  very  desirable  property  of  a  policy.  We 
seek  to  find  conditions  under  which  our  goal  will  be  achieved.  Let  /  6  IIssk.  Define 

/>(/):=  /  cdfi(f),  />*:=  inf  p(f) . 

J  J^UsSR 

Note  that  under  /  £  Hssr,  —  p(f)  for  each  i  £  S.  We  consider  two  sets  of 

hypotheses: 

(A5.18)  The  Near-Monotonicity  Condition: 

liminf  min  c(i,a)  >  p* . 

i— >oo  a£A 

Intuitively  (A5.18)  penalizes  the  drift  of  the  process  away  from  some  finite  set,  requiring  the 
optimal  policy  to  exert  some  kind  of  a  ‘centripetal  force’  pushing  the  process  back  towards 
this  finite  set.  Thus,  the  optimal  policy  gains  the  desired  stability  property.  If  c(i,a )  =  k(i) 
for  some  k  :  S  — >  K.+  and  k(i)  is  increasing,  then  this  condition  will  automatically  be 
satisfied.  Such  penalizing  conditions  quite  often  occur  in  queueing  applications  (see  [20], 
[151],  [152],  [168],  [186]). 

(A5.19)  Stability  Condition  (cf.  (A5.7)-(A5.12)):  nSR  =  IIssr  and  Ip  is  compact. 

(A5.19')  Equivalent  conditions  to  (A5.19)  are: 

(i)  n5D  =  nSSD  and  Ip  is  compact. 

(ii)  The  mean  return  times  to  a  prescribed  state  (say  0)  are  uniformly  integrable 
over  all  f  £  lisp. 

(iii)  Same  as  (ii)  but  with  Hsd  replacing  nsR- 

Theorem  5.10.  Under  (A5.18)  or  (A5.19)  there  exists  an  f  £  Hssd  which  is  sample  path 
average  cost  optimal  in  Hsr. 

Proof.  From  Lemma  5.2  it  can  be  shown  by  an  application  of  Choquet’s  theorem  [25],  [26], 
that  if  v  i — •  f  cdv  attains  its  minimum  on  Ip,  it  will  do  so  for  an  f  £  Hsp.  Under  (A5.19) 
it  can  be  shown  that  /  > — *■  fj(f)  is  continuous.  Therefore,  the  desired  result  follows  under 
(A5.19).  We  next  consider  the  case  under  (A5.18).  Let  fn  £  Hs#  be  such  that  p(fn)  i  P* ■ 
By  identifying  f)(/„)  with  the  element  of  V{S  x  A)  that  restricts  to  it  on  5  X  A  for  each 
n  and  then  dropping  to  a  subsequence  if  necessary,  we  may  assume  that  fj(fn)  —*  v  in 
V{S  x  A)  for  some  v.  Let  n  —*  oo  in  the  equation 

f)(/n)({j}  X  A)  =  J  P{j  I  ■  ,  ■)  dfj(fn)  ,  jeS 
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and  argue  as  in  Lemma  5.1  to  conclude  that  for  v'  as  in  (5.18),  <5„  >  0  implies 

u\{j)  x  A)  =  J  P(j  |  • ,  •)  dv\  j  €  S. 

Decomposing  v'  as  v'(i,da)  =  V(i)f(i)(da),  i  e  S  we  haveF  =  Tj(f)  and  therefore,  v'  = 

Let  cm  =  c  A  m  for  m  >  1  and  pick  e  >  0  such  that  (A5.18)  continues  to  hold  with  p*  +  e 
in  place  of  p*.  Then 


>  lim  /  cmdf)(fn) 

n^oo  J 


>  K  J  cm  df](f)  +  (1  -  6u)((p*  +  e)  A  m). 

Letting  m  —*  oo, 

P*  ^  ^i/P*  +  (1  —  ^)/)(p’  +  £)  • 

This  is  possible  only  if  £„  =  1  and  /  cdrj(f)  —  p*.  □ 

The  above  theorem,  however,  does  not  ensure  optimality  of  the  cost-minimizing  policy 
in  Ifi  with  respect  to  arbitrary  policies.  For  the  near-monotone  case  this  can  be  resolved 
without  any  further  assumptions,  but  for  the  stable  case,  we  need  the  following. 

(A5.20)  If  r  =  min{t  >  1  :  Xt  =  0 } ,  then 

sup  Eq  [t2]  <  oo  . 

7r£n 


Remark  5.9.  (A5.20)  clearly  implies  (A5.19).  The  converse  need  not  be  true,  as  can  be 
shown  by  an  explicit  example  [24].  Some  sufficient  conditions  for  (A5.20)  are: 

(i)  A  Lyapunov  condition  [28],  we  will  describe  shortly  (cf.  Theorem  5.11), 

(ii)  the  strong  uniform  recurrence  condition  of  Doeblin  and  its  variants  [174],  and 

(iii)  the  condition  that  there  exist  an  N  <  oo  for  which 

sup  supPf  (r  >  N )  <  1 
s-en  • 

where  r  is  as  above. 

Theorem  5.11.  Under  (A5.18)  or  (A5.20)  there  exists  an  f  €  II5D  which  is  sample  path 
average  cost  optimal. 

Proof.  Under  (A5.20)  it  can  be  shown  [26]  that  the  processes  ut  as  defined  in  (5.17)  are 
tight  over  II.  Therefore,  <5„  as  in  the  statement  of  Lemma  5.1  may  be  taken  to  be  one.  This 
resolves  the  case  under  (A5.20).  Under  (A5.18),  let  u  be  a  limit  point  of  {i/t}  in  P(S  x  A) 
along  some  subsequence.  Then  as  in  the  proof  of  Theorem  5.9  it  can  be  shown  that 

(5.19)  lim  inf  f  cdvt>  p*. 
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Since  this  is  true  for  any  limit  point  v  of  {vt}  in  V(S  x  A)  and  for  all  sample  points  outside 
a  set  of  probability  zero,  the  desired  result  follows  in  this  case  also.  □ 

Remark  5.10.  Some  open  problems  arising  in  this  context  are: 

(i)  Can  (A5.20)  be  replaced  by  (A5.19)  while  retaining  the  desired  optimality? 

(ii)  If  Ilsft  =  IIssh,  will  (A5.19)  hold  automatically? 

Remark  5.11.  The  condition  in  (5.19)  implies  a  much  stronger  optimality  which  will  be 
discussed  in  Section  6. 

Now  after  the  existence  result  of  Theorem  5.11,  an  alternative  treatment  of  the  ACOE  is 
possible.  We  will  present  a  brief  description  without  proofs.  For  details,  see  [24],  [26],  [28]. 
Define  h  :  S  — *  R  by 

V— 1 

(5.20)  h(i)  =  El°  Y,{c(Xt'MXt))  ~  P*)  »  *'€S, 

,t~o  x 

where  r  =  min{t  >1  :  Xt  =  0}  and  /o  €  II50  is  any  sample  path  average  cost  optimal 
policy.  In  [22],  [24],  it  is  shown  that  (h(-),p*)  satisfies  the  ACOE  under  the  following 
additional  hypothesis  called  stability  under  local  perturbations. 

(A5.21)  Given  an  f  £  IIss£)  with  p(f)  <  00,  any  /'  £  IIsd  obtained  from  /  by  changing 
the  actions  at  most  finitely  many  states  is  also  stable  and  p(f')  <  00. 

A  sufficient,  though  not  necessary,  condition  for  (A5.21)  to  hold  is  that  every  state  has 
at  most  finitely  many  neighbors,  i.e.,  for  each  i  £  S ,  there  is  a  finite  set  Ri  C  S  such  that 
P(j  |  v)  =  0  for  j  g  R{. 

In  many  cases,  the  solution  ( p*,h )  of  the  ACOE  can  be  characterized  (Theorem  5.12 
below).  The  usual  characterization  of  AC-optimal  /  £  II5D  in  terms  of  the  ACOE  can  also 
be  proved  for  the  foregoing. 

Theorem  5.12.  Assume  (A5.18)  and  let  fo,  h  be  defined  as  above  (cf.  (5.20)).  Let 

H  =  {(p, w)  :  (p,w)  satisfies  the  ACOE,  u;(0)  =  0,  inf  w(-)  >  —00}. 

Then  (p*^)  is  the  unique  element  of  H  corresponding  to  the  minimum  value  of  p  (i.e.,  if 
(p1  ,w')  is  another  element  of  H ,  then  p'  >  p*  with  equality  if  and  only  if  w'  =  h). 

Now,  instead  of  (A5.18),  suppose  c  is  bounded  and  the  following  Lyapunov  condition  holds: 
There  exists  an  w  :  S  — >  R+,  a  finite  A  C  S  and  an  £  >  0  such  that: 

(a)  0  £  A  and  the  set  [i  £  Ac  :  P(j  j  i,a)  >  0,  for  some  j  £  A,  a  £  A}  is  finite. 

(b)  limi_00  w(i)  =  00. 

(c)  Under  any  rr  €  II,  p  £  V(S) 

El  [KAt+1 )  -  w(Xt)  +  s)I{Xt  t  A}  |  Jt]  <  0  ,  a.s. 

(d)  There  exists  a  random  variable  Z  and  a  scalar  A  >  0  such  that  E[exp(AZ)]  <  00 
and  for  all  b  >  0 

v;  (Kit+1)  -  w(xt)  |  >b\$t)<  p(z  >  b) . 
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Then  ( p*,h )  is  the  unique  solution  of  the  ACOE  in  the  class  {(/>,  i o)  :  w( 0)  = 
0,  limsup,-^  £$•  <  00} . 

Remark  5.12.  An  alternative  ‘intrinsic’  formulation  of  the  ACOE  is  also  possible.  For  any 
/  €  IIss.D,  define  hf  :  S  -*■  R  by 

"r — 1 

MO  =  E(  (c(xt,f(xt)) -p(fj)  ,  ies. 

j= 0 

We  say  that  /  is  locally  AC-optimal  if  it  yields  a  lower  cost  than  any  other  element  of 
IIso  obtainable  from  /  by  changing  /  in  at  most  finitely  many  states.  In  addition  to  the 
foregoing  hypotheses,  assume  that  every  locally  AC-optimal  /  is  AC-optimal  (for  bounded 
c,  a  sufficient  condition  for  this  is  that  IIsd  =  IIssd  and  {??(/)  :  /  £  IIssd}  is  tight).  One 
then  has  that  /  is  sample  path  average  cost  optimal  if  and  only  if,  for  i  6  S 

MO  =  infj^TMO'  |  i,  a)hj(j)  +  c(i,a)  -  p(/)j. 

This  statement  is  ‘intrinsic’  in  the  sense  that  all  quantities  (i.e.,  hj,p(f))  are  computable 
in  terms  of  /.  An  interesting  open  problem  is  to  characterize  the  most  general  conditions 
under  which  local  AC-optimality  implies  AC-optimality. 

Remark  5.13.  The  Lyapunov  condition  in  Theorem  5.12(ii)  implies  (A5.20)  and  has  many 
other  implications  [26],  but  the  condition  (ii)(d)  there  is  rather  strong  and  due  to  this  it 
may  be  difficult  to  construct  such  a  function  in  a  given  situation.  A  partial  answer  to  this 
question  is  given  in  [72].  It  would  be  interesting  to  investigate  if  the  Lyapunov  conditions 
studied  by  [89],  [53]  (cf.  (A5.13)),  which  do  not  involve  (ii)(d)  above,  imply  (A5.20). 

§6.  Borel  State  and  Action  Spaces 

We  consider  in  this  section  the  case  in  which  S  and  A  are  general  Borel  spaces.  This 
is  a  natural  setting  for  many  problems,  e.g.  control  of  stock  in  water  reservoirs,  allocation 
of  a  resource  between  production  and  consumption,  control  of  biological  populations,  har¬ 
vesting  a  natural  resource;  see  [17],  [50],  [80],  for  several  examples,  and  references  therein. 
Also,  the  equivalent  formulation  of  POCMP  in  terms  of  the  conditional  distribution  of  the 
(unobservable)  state  leads  to  a  problem  with  an  uncountable  Borel  state  space,  as  we  are 
going  to  see  in  Section  7. 

In  this  more  general  context,  the  ACOE  is  written  as 
(6.1)  p(x)  +  h(x)=  inf  \c(x,a)+  f  h(y)P(dy  \  x,a)\ 

a£U(x)  (  Js  J 

=  T(h)(x),  x€S, 

where  p,h  €  Ad(5).  As  in  Section  5,  a  pair  of  functions  (p,h)  as  above  is  called  a  solution 
to  the  ACOE,  and  if  p  and  h,  are  bounded,  we  will  say  that  the  solution  is  bounded.  Also, 
as  in  Theorem  5.1,  our  aim  is  to  relate  the  AC  problem  to  the  existence  of  solutions  to  the 
ACOE.  We  have  the  following. 
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Theorem  6.1.  Suppose  that  ( p,h )  is  a  solution  to  the  ACOE,  and  that  for  each  policy 
x  £  Hm,  the  following  holds 


(6.2) 


lim  El 

t— ►oo 


h(Xt)  1 


=  0, 


Vx  €  S. 


Then  we  have  that 

(i)  there  holds 

(6.3)  lim  sup  — l—El 

n— .oo  U  -f"  1 

and  if  x  £  IIsd  is  such  that  x(x)  attains  the  infimum  in  (6.1),  then  equality  is 
attained  in  (6.3); 

(ii)  if  p(x)  =  p*  £  R.  for  all  x  £  S,  then  J*( i)  =  />*,  for  all  x  £  S ,  and  any  x*  £  IIsd 
such  that  x*(x)  attains  the  infimum  in  (6.1)  is  average  optimal. 


X>(*.) 


Lt=0 


The  proof  of  Theorem  6.1  follows  that  of  Theorem  5.1  and  is  essentially  contained  in  [173], 
and  more  explicitly  in  [78],  [144];  see  also  [76,  pp.  66-68],  [80,  pp.  53-55],  [146,  pp.  93-94]. 
Note  that  (i)  above  says  that  if  p(-)  is  taken  as  the  cost  function  to  define  another  CMP 
(5,  A,  U,  P,p)  then,  for  any  x  £  II a/,  the  average  cost  incurred  under  the  cost  function  p(-) 
does  not  exceed  that  under  cost  function  c(-,-). 

Given  the  results  above,  it  is  of  interest  to  find  conditions  under  which  there  exists  a 
solution  (p,  h)  to  the  ACOE,  satisfying  (6.2).  If  h  is  bounded,  then  (6.2)  is  satisfied  trivially. 
Also,  if  the  random  variables  {/i(A't)}  are  uniformly  integrable  under  V ”,  for  x  £  II m  and 
x  £  S,  then  there  exists  a  constant  0  <  K ”  <  oo  such  that  E”[\h(Xt)\]  <  K”.  Hence,  if 
such  a  uniform  integrability  condition  holds  under  every  x  £Y1m  and  x  £  S,  then  (6.2)  is 
also  satisfied  trivially.  The  latter  approach  has  been  used  by  Shwartz  and  Makowski,  for 
some  queueing  problems  [162],  [163],  [164]. 

6.1.  Bounded  costs.  We  first  assume  that  c(-,-)  is  bounded.  When  there  are  bounded 
solutions  (p,h)  to  the  ACOE,  then  much  stronger  results  than  those  in  Theorem  6.1  (i)  can 
be  obtained.  To  state  these,  some  definitions  are  needed. 

Let  R  and  H  be  bounded,  measurable  real- valued  functions  on  S,  i.e.,  R,H  £  A4t,(5'), 
and  let  ff*  6  n.  Following  the  terminology  of  Yushkevich  and  Dynkin  [50],  the  triplet 
( R,H,x *)  is  said  to  be  canonical  if 


(6.4)  Jn{x,x*,H)  =  J'n{x,H)  =  H(x)  +  NR(x),  VfV  e  N0  ,  x£  S, 

and  x*  £  n  is  said  to  be  a  canonical  policy  if  it  is  an  element  of  some  canonical  triplet. 
Note  that  if  (R,H,  xm)  is  a  canonical  triplet,  then  x‘  is  IV- stage  optimal,  for  all  N  £  No, 
when  H  is  taken  as  the  terminal  cost.  This  concept  was  introduced  by  Yushkevich  [200]. 
For  finite  models  Denardo  and  Fox  [36]  used  a  similar  approach. 

A  policy  7r*  £  n  is  said  to  be  strong  average  optimal  if 

(6.5)  limsup  -i-J^(i,7r*)  <  liminf  4-J/v(i,7r) ,  Vi  £S,  V7r€n. 

rj  — .oo  A  tv — .oo  A 
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Alternate  definitions  of  strong  average  optimality  are  given  in  [67],  [68].  Clearly,  a  strong  av¬ 
erage  optimal  policy  7T*  is  also  average  optimal,  and  the  limit  of  the  sequence  {jj  Jn(x,  7r*)}, 
as  N  — *■  oo,  exists.  An  interpretation  of  (6.5)  is  that  the  ‘most  pessimistic’  average  perfor¬ 
mance  under  tt*,  is  no  worse  than  the  most  ‘optimistic’  performance  under  any  other  policy. 
We  have  the  following. 

Theorem  6.2.  Let  v*  £  IIsd,  let  P,h  £  Mb(S)  and  c  £  Mb(K).  Then  (p,h,x*)  is  a 
canonical  triplet  if  and  only  if 

(6.6)  p(x)  =  inf  {  [  p(y)P(dy\x,a)\ 

a£U(x)  IJS  ) 

and 

(6.7)  p(x)  +  h(x)=  inf  (c(x,a)+  [  h(y)P(dy  \  x,a)\ 

aeU(x)  {  Js  J 

and  7r*(x)  attains  the  infimum  in  both  (6.6)  and  (6.7),  for  all  x  £  S. 

Proof.  (Necessity)  Let  (p,h,ir*)  be  a  canonical  triplet.  Then,  by  (6.4) 

h(x)  +  p(x)  +  Np(x)  =  Jx+l(x,  h) 

(6.8)  =T(J*N)(x) 

=  c(a:,Jr*(a:))  +  [  J^(y,h)P(dy  \  x,tt*(x))  . 

Js 

Since  Jo(x,n*,h )  =  J0*(z,/i)  =  h(x),  then  (6.7)  follows  from  (6.8),  by  letting  N  =  0. 
Furthermore,  since  />(•),  h(-),  and  c(< ,  •)  are  bounded,  then  dividing  both  sides  of  (6.8)  by 
N  and  letting  N  -*  oo,  yields  (6.6). 

(Sufficiency)  Let  (p,h)  satisfy  (6.6)  and  (6.7),  and  let  n*(x)  attain  the  infimum  in  these 
expressions.  We  use  induction  to  show  that  ( p,h,nm )  is  a  canonical  triplet.  For  N  —  0,  this 
is  trivially  satisfied.  Suppose  N  £  No  is  the  first  integer  for  which  (6.4)  fails,  then 

rN(x,h)  =  T(j-N_l)(x) 

=  T(h  +  (N  -  l)p)(x) 

=  inf  \c(x,a)+  [  h(y)P{dy  \x,a)  +  (N  -  1)  [  p(y)P(dy  \  x,a)\ 

aeU(x)  (  Js  Js  J 

>  T(h)(x)  +  (N  -  1)  inf  j  f  p(y)P{dy  |  x,a)\ 

*€U(x)lJs  J 

=  T(h)(x)  +  (N  -  l)p(x)  =  h(x)  +  Np(x). 

On  the  other  hand, 

Jff(x,h)  <  J n(x,k* ,h) 

=  c(x,7T*(x))  +  J  Jfj_l(y,-K*,h)P(dy  \  x,ir*(x)) 

=  c(x,ir*(x))  +  J  [/i(r/)  +  (N  -  l)p(y)]P(dy  |  x,tt*(x)) 

=  T(h)(x)  +  (N  -  1  )p(x)  -  h(x)  +  Np(x) 
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contradicting  our  hypothesis.  Therefore,  (p,  h,  x*)  is  a  canonical  triplet.  □ 

The  results  in  Theorem  6.2  were  obtained  by  Yushkevich  [200];  see  also  [50].  Note  that 
(6.7)  is  the  ACOE,  and  (6.6)  allows  p(-)  to  be  treated  as  a  constant,  with  respect  to  the 
optimization  problem.  Of  course,  if  p(x)  =  p’,  for  all  x  £  S,  then  (6.6)  is  satisfied  trivially. 
The  coupled  equations  (6.6)  and  (6.7)  were  apparently  introduced  by  Howard  [92,  pp.  61- 
62],  in  the  context  of  finite  state  CMP  for  which,  under  some  policies,  {A*}  has  several 
ergodic  classes,  i.e.,  the  so-called  multi-chain  case.  In  this  case,  different  ergodic  classes 
may  have  different  optimal  average  cost,  and  p(-)  gives  this  cost,  as  will  be  shown. 

From  Theorem  6.2  we  see  that  the  canonical  policy  x*  is  a  measurable  selector  for  both 
(6.6)  and  (6.7).  However,  condition  (A2.2)  in  Section  2  is  not  enough  to  guarantee  the 
existence  of  selectors  in  either  (6.6)  or  (6.7),  since  p  and  h  are  assumed  to  be  bounded 
and  measurable  functions,  but  not  necessarily  lower  semicontinuous.  For  this  situation,  the 
following  condition  is  needed. 

(A6.1)  The  transition  kernel  P(-  |  x,a)  is  strongly  continuous  in  (x,a);  that  is,  u  £  Mb(S) 
implies  fs  u(y)P(dy  |  • ,  •)  €  Cb(K). 

It  follows  that  under  conditions  (A2.1),  (A2.3)  and  (A6.1),  measurable  selectors  exist  for 
each  of  (6.6)  and  (6.7),  and  7 r*  £  Hsd  will  be  a  canonical  policy  if  and  only  if  it  is  a  selector 
for  both  (6.6)  and  (6.7).  If  (p,  /i,x*)  is  a  canonical  triplet,  then  (p,  h)  solves  the  ACOE, 
and  (6.2)  is  satisfied,  since  h  is  bounded.  Consequently,  the  results  of  Theorem  6.1  follow. 
The  next  result  presents  other  important  implications. 

Theorem  6.3.  Let  (p, /i,x*)  be  a  canonical  triplet,  and  let  c  £  Mb(K).  Then,  for  each 
x  £  S, 

(i)  J/v(z,  x*)  <  J/v(x,  7r )  -f  span(/i),  for  every  x  £  n,- 

(ii)  x*  is  strong  average  optimal; 

(iii)  J(x,x*)  =  J*(x)  =  p(x); 

(iv)  h~(x)  +  fQ  <  Jp(x)  <  h+(x)  +  Q; 

(v)  If  p(x)  —  p*  £  R,  for  all  x  £  S,  then  for  every  x  £  n 

1  N~l 

limsup  —  V  c(A'i,At)  >  p*,  V*-a.s., 

N 

when  A'o  =  x,  and  {A(}  is  generated  using  the  policy  x.  Furthermore 

^wE<x,'A,)  =  p'’ 

t=o 

if  and  only  if 

N-l 

lim  =  0,  Vl-a  .s., 

/v— *■<»  7V  ^ J 
t- 0 

where  $  :  K  — *  R  is  given  by 

$(x,o)  :=  c(x,a)  +  J  h(y)P(dy  |  x,a)  —  p*  —  h(x) . 

(vi)  x*  is  sample  path  average  cost  optimal. 
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Proof.  To  prove  (i),  note  that  for  all  tt  £  II 


Hence, 


Jn(x,-k*  ,h)  =  E* 


<e: 


TV— 1 

J^c(Xt,At)  +  h(XN) 
,t= o 
TV — 1 

22c(Xt,At)  +  h(XN) 

L  t=o 


=  JN(x,ir,h ) 


JN(x,x*)  <  Jn(x,k)  +  El[h{XN)\  -  Ef[h(XN)} 
<  Jn(x,tt)  +  span(h) ,  V7r£ll. 


By  the  boundedness  of  h(-),  we  have  that 


Jim  —  JN(x,ir%h)=z  Jim 

N —too  TV  TV—* oo 


h(x)  4-  N p{ x) 

N 


=  P(x). 


Furthermore,  since  Jyv(a:,z’,/i)  =  J^(x,7rm)  +  E*  [/i(A'zv)],  then 


P 0)  =  ^lim^  —  J/v(x,7r*) 


and  (ii)-(iii)  follows  from  (i). 

Next,  since  (p,  h)  solve  the  ACOE,  then  (p,  h~ )  and  (p,/i+)  are  also  solutions  to  the 
ACOE.  Since  h~(-)  <  0  <  /i+(-),  then  by  Lemma  2.1  we  have  that  T(h~ )  <  T(Ph~)  = 
Tp(h~),  and  T(h+)  >  T(/3h+)  =  Tp(h+).  Then,  (iv)  follows  by  induction,  using  Theo¬ 
rem  2.1  (iv);  see  [62]. 

Turning  our  attention  to  (v)  and  (vi),  observe  that,  due  to  (6.7),  <f>(x,a)  >  0,  for  all 
(x,a)  6  K.  Also,  by  the  (Markov)  property  (2.3)  in  Section  2,  we  have  that,  for  any  7r  £  II, 


$(-Yt,M0  =  El  c(Xt,  At)  +  h(Xt+ a)  -  p*  -  h(Xt) 


HuAt 


- p * 

r  X 


-a.s. 


Let 

Zt  :=  c(Xt,At)  +  h(Xt+1)  -  h(Xt)  -p*  -  S(Xt,  At) , 

and 

TV-1  TV-1  TV-1 

Mn  :=  Y,  Zt  =  E  +  M*tv)  -  h(X°)  -  E  *(**>  • 

4= 0  4=0  4=0 

Note  that  { Zt }  is  a  (<5t,V*)  martingale  difference,  where  <3t  :=  a(Ht+\,  At+i).  Since  {Ztj 
is  bounded  uniformly  in  f,  by  the  martingale  stability  theorem 


lim 

N— ►OO 


Mn 

N 


1  7V_1 

lim  —  V  Zt  —  0  , 

TV— »oo  JV 

4  =  0 


p;-a.s. 
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Therefore,  by  the  boundedness  of  h(-), 


lim 

N-+00 


N-l 


TV-1 


jit,  *(*>.*<) 


t= 0 


1=0 


0, 


7>;-a.s. 


Finally,  (v)  and  (vi)  follow  since  3>(x,a)  >  0,  for  all  (x,a)  €  K,  and  since  for  a  canonical 
policy  7r*,  $(Xt,  At)  =  0,  7^-a.s.  □ 

The  results  in  (i)-(iii)  of  Theorem  6.3  are  essentially  contained  in  [50,  Chap.  7];  that  in 
(iv)  is  motivated  by  similar  results  in  [132]  and  [62];  (v)  and  (vi)  are  due  to  Georgin  [70], 
see  also  [80,  pp.  52-55].  Also,  the  function  $  defined  in  (v)  was  introduced  by  Mandl  [120] 
and  is  often  referred  to  as  Mandl’s  discrepancy  function. 

In  view  of  Theorem  6.3,  it  follows  that  a  canonical  triplet  yields  the  desired  results.  We 
therefore  look  for  conditions  on  the  primary  objects  like  the  cost  function  c  and  transition 
kernel  P  which  imply  the  existence  of  a  canonical  triplet,  so  that  the  theory  can  be  used  in 
a  given  practical  situation.  To  this  end,  a  standard  procedure  is  to  assume  some  ergodicity 
conditions  which  will  ensure  the  existence  of  a  bounded  solution  to  the  ACOE.  AAe  have 
already  discussed  several  such  conditions  for  the  countable  state  case  (cf.  (A5.1)-(A5.5)). 
Analogues  of  such  assumptions  are  also  available  in  the  literature,  an  extensive  survey  of 
which  appears  in  [84].  We  will  focus  on  a  particular  ergodicity  condition  which  not  only 
subsumes  many  such  conditions  but  also  facilitates  easily  implementable  numerical  schemes. 
Our  presentation  here  follows  essentially  that  in  [80,  Chap.  3]. 

(A6.2)  There  exists  a  number  a  <  1  such  that 


sup  ||P(.|fc)-P(.|fc')||rv<2a, 
k,k'eK 


where  ||  •  ||xv  denotes  the  total  variation  norm. 

Example  6.1.  Let  S  —  R,  A  C  R,  a  compact  set.  Consider  the  system 


Xt+1  =  F(XuAt)  +  G(Xt)Wt,  X0  =  x, 


where  F:RxA^R,  G:R-tE  are  bounded,  continuous  and  G(-)  >  0,  and  {Wt} 
is  a  sequence  of  independent  ]V(0,1)  random  variables  ( N(a,b )  stands  for  the  Gaussian 
distribution  with  mean  a  and  variance  b).  In  this  case  the  transition  kernel  is  given  by 

P(-  |  x,a)  =  N(F(x,a),G2(x ))  . 

Using  the  assumed  conditions  on  F.  G  one  can  show  that  (A6.2)  holds.  We  omit  the  details. 
An  important  consequence  of  (A6.2)  is  given  below;  for  a  proof  and  further  discussion  see 
[80,  Chap.  3]. 
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Lemma  6.1.  Suppose  (A6.2)  holds.  Then,  for  any  f  £  Hsd>  corresponding  process 
{A"*}  has  a  unique  invariant  measure  T}(f)  £  V(S)  satisfying 

(6-9)  ||P‘(-  |  *,/(*))  -  !,(/)(■) ||TV  <  2a4,  t  =  0,1,..., 

where  Pl  (•  |  x,  f(x))  denotes  the  t-step  transition  probability  measure  under  f  with  Xq  =  x. 

Remark  6.1.  (a)  Lemma  6.1  also  holds  for  any  /  £  IIsh- 

(b)  It  follows  from  (6.9)  that  for  any  /  £  IIsd>  Pi{'  \  z,/(z))  converges  to  rj(f)  in  total 
variation  norm,  uniformly  in  x  and  at  a  geometric  rate. 

(c)  It  is  clear  that  for  any  /  £  IIsd, 

*%,/)  =  J  c(x,f(x))r^f)(dx) 

for  any  initial  law  p. 

(d)  Compare  (6.9)  with  (A5.5).  In  view  of  Theorem  5.4,  (A6.2)  may  be  viewed  as  a 
representative  counterpart  of  assumptions  (A5.1)-(A5.5)  for  the  general  state  space 
case. 

We  now  introduce  the  concept  of  span-contraction. 

Definition  6.1.  Let  T  :  Adj^S)  — *  Mb(S).  T  is  said  to  be  a  span-contraction  if,  for  some 
7€  [0,1), 

span(Tu  -  Tv)  <  7  span(u  -  v) ,  for  all  u,  v  £  M b( S) . 

Let  ~  be  the  equivalence  relation  on  Mb(S)  defined  by  u  ~  v  if  and  only  if  there  exists 
some  constant  C  such  that  u(x)  —  v(x)  —  C  for  all  x  £  S.  Let  A4(,(5')  =  M.b{S)/  ~, 
the  quotient  space,  endowed  with  the  quotient  norm  induced  by  the  span  seminorm.  For 
v  £  Mb(S),  let  v  denote  the  corresponding  element  of  Mb(S )  and  T  :  Mb(S)  —  ►  Mb(S) 
be  the  canonically  induced  map,  i.e.,  Tv  =  Tv,  v  £  Mb(S).  It  is  easily  seen  that  if  T  is  a 
span-contraction  on  Mb{S)  then  T  is  a  contraction  on  Mb{S)  and,  therefore,  has  a  unique 
fixed  point.  In  turn,  it  follows  that  the  map  T  has  a  span-fixed  point,  i.e.,  there  exists  a 
u*  €  Mb(S)  such  that  span(Tv’  —  v*)  =  0  or  equivalently,  Tv *  —  v"  is  a  constant.  It  also 
follows  that  any  two  span-fixed  points  of  T  must  differ  by  a  constant. 

We  now  replace  (A2.3)  with  the  following: 

(A6.3)  (i)  The  multifunction  U(x)  is  continuous; 

(ii)  c(-,-)  €  Cb(K). 

We  have  the  following  result;  for  a  proof  see  [80,  Lemma  3.5]. 

Lemma  6.2.  Under  (A2. 2),  (A6.2)  and  (A6.3),  theoperatorT  defined  in  (2.5)  maps  Cb(S) 
to  Cb(S)  and  is  a  span-contraction. 
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Corollary  6.1.  Under  (A2.2),  (A6.2)  and  (A6.3)  the  ACOE  has  a  bounded  solution 
(P*,h*)eRxCb(S). 

Proof.  This  follows  from  the  fact  that  there  exists  a  h*  £  Cb(S)  such  that  span (T/i*  —  h*)  = 
0.  Hence,  Th *  =  h*  +  p*  for  some  constant  p*.  □ 

Remark  6.2.  (a)  Assume  (A6.2)  and  (A6.3).  Let  ( p',h *)  gRx  Cb(S)  be  a  solution  to  the 
ACOE  and  fix  xq  €  S.  Define  h(-)  —  /i*(-)  —  h*(xa).  Then  ( p*,h )  is  also  a  solution  to 
the  ACOE.  By  the  span-contraction  property  of  T,  it  is  the  unique  solution  in  Rx  Cb(S) 
satisfying  h(xo)  =  0,  i.e.,  if  ( p',h ')  G  Rx  Cb(S )  is  any  other  solution  of  the  ACOE  in 
R  x  Cb(S )  such  that  h'(x o)  =  0,  then  p'  =  p  and  h'  =  h.  (b)  In  view  of  the  span-contraction 
property  of  the  operator  T,  the  value  iteration  scheme  described  in  Section  4  can  be  extended 
to  this  case;  for  details  we  refer  to  [80,  Chap.  3]. 

Remark  6.3.  In  Section  4,  we  have  identified  the  duality  between  the  linear  programming 
formulation  and  the  ACOE  under  the  irreducibility  assumption.  This  has  been  extended 
by  Yamada  [199]  to  the  case  when  the  state  space  S  is  a  compact  subset  of  Rn  and  the 
transition  law  has  a  density  which  satisfies  a  certain  ‘positivity’  condition.  Hernandez- 
Lerma  et.al.  [82]  have  further  extended  this  result  to  the  Borel  state  space  setting  under 
condition  (A6.2). 

Kurano  [102]— [104]  has  studied  the  problem  for  compact  state  and  action  spaces,  under 
the  hypothesis  of  Doeblin.  Doeblin’s  condition  for  the  general  state  space  can  be  described 
as  follows. 

(A6.4)  There  exists  a  nontrivial  finite  measure  p  on  (S,£?(S)),  a  positive  integer  i  and  an 
e  >  0  such  that 


Pl(A  I  x,/(i))  >  1  —  £  ,  if  p(A)  >  £  , 
for  all  /  6  Bsc  and  x  €  5. 

Theorem  6.4.  Let  the  state  and  action  spaces  be  compact  and  (A6.3),  (A6.4)  hold.  Then 
there  exist  an  f  €  II5D  and  a  set  A  €  B(S)  with  p(A)  >  £  such  that  P[A  |  x,f(x ))  =  1 
for  all  x  €  S ,  and  f  is  optimal  provided  that  the  initial  law  is  supported  on  the  set  A. 

Further  assume  the  following. 

(A6.5)  (Reachability).  For  any  1  G  S  and  D  €  B(S)  with  p{D)  >  £  (p  and  e  as  in  (A6.4)) 
there  exists  a  ir  €  II  such  that 

p;(0{A-,er>})  =  i. 

Y=o  ' 

(A6.6)  One  of  the  following  two  conditions  is  satisfied: 

(i)  p{dD)  =  0  if  p(D)  >  0,  where  dD  denotes  the  boundary  of  D. 

(ii)  For  each  D  €  B(S)  with  p{D)  >  £,  P(D  |  x,a )  is  continuous  in  (x,a). 
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Theorem  6.5.  Under  (A6.3)-(A6.6)  there  exists  an  f  £  IIsd  which  is  optimal. 

Remark  6.4.  (a)  The  proof  of  Theorem  6.3  exploits  the  idea  involved  in  Lemma  5.1  of 
extracting  a  stationary  randomized  policy  from  a  limit  point  of  empirical  processes.  A 
novel  idea  in  [102]  is  to  remove  the  randomization  by  using  the  ergodic  decomposition 
of  Markov  processes  under  (A6.4).  The  compactness  is  used  to  ensure  the  tightness  of 
the  empirical  processes  under  any  policy.  This  can  be  dropped  if  the  cost  function  has  a 
penalizing  condition  or  if  there  is  a  blanket  stability  of  Lyapunov  type.  The  details  closely 
mimic  the  development  at  the  end  of  Section  5.  (b)  Wijngaard  [197]  has  also  obtained  the 
existence  of  an  optimal  /  €  IIsd  under  Doeblin’s  condition  using  an  operator  theoretic 
method. 

We  will  now  discuss  the  vanishing  discount  approach  to  obtain  a  bounded  solution  to 
the  ACOE.  For  a  fixed  xo  £  5,  let  hp(-)  —  Jp(-)  —  Jq{ zo)  denote  the  differential  discounted 
value  function.  For  a  general  state  space,  the  usual  diagonalization  procedure  used  on  a 
countable  state  space  is  not  amenable.  Nevertheless,  if  hp(-)  is  uniformly  bounded  and 
equicontinuous,  then  one  can  use  a  more  subtle  diagonalization  involving  the  Arzela-Ascoli 
theorem  to  take  the  required  limits  and  obtain  a  bounded  solution  to  the  ACOE.  This  was 
studied  by  Ross  [144].  Following  [17],  [70],  [71],  we  will  discuss  some  sufficient  conditions 
to  obtain  the  required  uniform  boundedness  and  equicontinuity  of  hp(-). 

(A6.7)  For  each  / 3  £  (/?',  1),  for  some  0  <  j3'  <  1,  and  fp  £  IIsz?,  the  corresponding  state 
process  has  a  unique  invariant  probability  measure  rj(fp)  such  that 

OO 

(6.10)  sup  |  x,fp(x))-r){fp)(-)\\TV  <  00. 

x€S 
P€(0\  1) 

The  following  result  is  now  easy  to  establish. 

Lemma  6.3.  Under  (A6.1),  (A6.3)  and  (A6.7),  hp(-)  :=  Jp{-)  —  Jp(x 0)7  xq  £  S  fixed,  is 
uniformly  bounded  and  equicontinuous  for  (3  £  (f3',  1). 

Corollary  6.2.  Under  (A6.1),  (A6.3)  and  (A6.7),  the  ACOE  has  a  solution  ( p*,h )  such 
that  h  £  Cb(S). 

Remark  6.5.  If  (A6.4)  is  satisfied  and  we  further  impose  the  condition  that  for  every  /  £ 
IIsd,  the  corresponding  state  process  has  a  single  ergodic  class,  then  (6.10)  holds.  In 
particular,  if  P{dy  \  x,a)  has  a  density  p(y,x,a),  with  respect  to  some  cr-finite  measure 
p,  and  there  exists  a  nonnegative  measurable  function  po  satisfying  f  p0(y)p(dy)  >  0  and 
p(y,x,a )  >  po(y),  for  all  (x,a),  then  (A6.4)  holds  and  (6.10)  can  be  easily  verified.  If 
(x,a)  —  p(y,x,a )  is  continuous,  then  by  Scheffe’s  theorem,  p(-  |  x,a)  is  strongly  continuous 
in  ( x,a ). 

6.2.  Unbounded  costs.  We  now  drop  the  boundedness  condition  on  the  cost  function 
and  discuss  some  recent  developments  involving  refinements  and  extensions  of  the  vanishing 
discount  approach.  Since  for  unbounded  costs  the  uniform  boundedness  of  the  differential 
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discounted  value  function  hp(-)  is  rather  unnatural,  one  attempts  to  extend  the  procedure 
of  [151],  [152]  to  the  present  case.  To  this  end,  we  make  the  following  analogues  of  (A5.14)- 
(A5.16): 

(A6.8)  There  exists  a  nonnegative  function  b  £  A4(S),  a  constant  M  >  0,  and  a  sequence 
{/3n}  C  (0, 1),  /?„  |  1,  such  that  for  all  a;  £  5, 

(i)  —M  <  hpn(x)  <  b(x) 

(ii)  fs  b(y)P(dy  |  x,a)  <  oc,  for  all  a  £  U(x). 

(A6.9)  There  exists  a  policy  7r  and  an  initial  state  x  such  that  J(x,tt)  <  oo. 

(A6.10)  There  exists  /?'  £  (0,1)  such  that  sup p)hp(x)  <  oo  where  hp(x)  =  Jp(x)  — 
infl€s  Jp(x). 

(A6.ll)  The  transition  kernel  P(-  |  x,a)  is  strongly  continuous  in  o,  for  each  x  £  S. 

Under  (A6.8)  and  (A6.ll),  defining  h(x)  =  liminf hpn(x),  x  £  S,  and  using  Fatou’s 

n— *00 

lemma,  one  can  show  that  there  exists  a  constant  p*  such  that 

(6.11)  lim  (1  -  /3ni)Jg  ,(x)  =  p*,  for  all  z  £  S, 
where  (3ni  |  1  is  a  subsequence  of  {fin},  and 

(6.12)  P’ +  h(x)  >  min  <c(x,a)+  f  h(y)P(dy  \  x,a)\ ,  x  £  S , 

a€0(x)  {  Js  ) 

which  is  the  ACOI  (see  (5.16))  for  this  case.  Similarly,  under  (A6.9)-(A6.11),  one  can 
find  a  constant  p *  such  that  along  a  suitable  sequence  f3n  £  (/3',1),  f3n  |  1,  limn.^00(l  — 
f3n)  infres  Jp(x)  -  p*.  Then  defining  h(x)  =  liminf^^oo  hpn(x)  one  can  deduce  (6.12). 
Thus,  we  have  the  following  result. 

Theorem  6.6.  Under  (A6.8)  and  (A6.ll )  or  under  (A6.9)-(A6.11 ),  there  exists  a  constant 
p*  and  a  function  h  which  is  bounded  below  and  satisfies  (6.12).  Any  policy  7r  £  II50 
realizing  the  minimum  on  the  right-hand  side  of  (6.12)  is  average  optimal  and  p*  is  the 
minimum  average  cost. 

Remark  6.6.  For  details,  we  refer  to  [81],  [83],  [136].  In  the  case  of  a  countable  state  space  a 
number  of  sufficient  conditions  on  the  transition  kernel  and  the  cost  function  which  enable 
us  to  verify  (A5.14)-(A5.16)  are  available,  as  mentioned  in  Section  5.  This  does  not  seem 
to  be  the  case  for  a  general  Borel  state  space  model,  although  several  interesting  examples 
have  been  studied  in  [81],  [83]  and  [136].  Also,  (A6.ll)  is  a  very  strong  condition  and  will 
not,  in  general,  be  satisfied  for  the  transition  kernel  of  the  equivalent  problem  for  a  partially 
observable  model.  Thus,  this  case  needs  further  investigation.  Finally,  note  that  (A6.10) 
may  in  principle  be  easier  to  verify  than  (A6.8). 

Remark  6.  7.  We  note  that  Theorem  6.6  provides  only  an  ACOI  and  not  the  ACOE.  In  many 
situations,  the  discounted  value  function  is  convex  (e.g.  in  linear  systems  with  quadratic 
cost  [14]),  or  concave  (e.g.  the  separated  problem  in  partially  observable  models).  This 
class  of  problems  has  been  used  in  [59]  to  obtain  the  ACOE  under  (A6.8),  (A6.ll)  and 
some  additional  assumptions. 
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§7.  Partially  Observable  Controlled  Markov  Processes 

Thus  far,  we  have  assumed  that  the  complete  history  of  the  process  Ht  is  available  to  the 
decision-maker,  at  each  stage  t  €  T.  However,  in  many  situations  some  components  of  the 
state  process  may  not  be  directly  available  to  the  controller  since,  e.g.  it  may  be  impossible  or 
too  costly  to  measure  these.  Furthermore,  due  to  imprecisions  in  the  measuring  devices,  only 
noisy  observations  of  the  state  may  be  available.  When  these  situations  arise,  the  problem 
is  said  to  be  a  partially  observable  controlled  Markov  process.  We  study  here  POCMP  with 
finite  or  countably  infinite  state  and  observation  spaces,  and  finite  or  compact  action  set.  A 
major  portion  of  our  exposition  concentrates  on  the  vanishing  discount  method,  where  we 
see  that  the  particular  structure  of  the  POCMP  can  be  employed  to  yield  stronger  results 
than  those  available  for  general  Borel  spaces.  We  also  review  Borkar’s  convex  analytic 
approach,  specialized  to  the  partially  observable  case  [26]. 

7.1.  Models  with  partial  state  information. 

The  model  for  this  problem  is  essentially  that  in  [50,  Chap.  8]  and  is  as  follows.  The  state 
process  is  described  by  a  pair  {Xt,Yt}  teT  taking  values  in  a  product  of  Borel  spaces  X  x  Y. 
Only  the  second  component  {P(}t£T  of  the  state  process  is  available  for  decision-making 
and,  reflecting  this,  Y  is  called  the  observation  or  message  space  and  Yt  the  observation 
process.  With  A  denoting  the  action  space,  the  evolution  of  the  system  is  governed  by  a 
measurable  stochastic  kernel  P  on  X  X  Y  given  X  x  Y  x  A. 

Let  p  €  V{X  x  Y)  be  an  initial  distribution  of  the  state.  Decomposing  (disintegrating) 
the  measure  p,  we  have 

p(dx,dy)  -  Q0(dy)  t p0(dx  |  y) , 

where  Q0  is  the  marginal  of  p  on  Y  and  xp0  is  a  version  of  the  regular  conditional  law, 
defined  Q0-a.s.;  we  pick  any  version  from  this  equivalence  class  and  keep  it  fixed  thereafter. 
Note  that  knowledge  of  p ,  since  the  value  of  Yq  is  available  to  the  controller,  implies  that 
an  a  posteriori  distribution  ipo  (given  Yo  =  y)  for  the  unobserved  initial  state  is  introduced. 
We  include  i/>0  into  the  observed  history  by  letting 

Ho  'P(X)  x  Y ,  Ht  :=  Ht-i  X  Y  X  A,  te  N0. 

The  set  of  admissible  actions  is  specified  by  a  strict,  measurable,  compact- valued  mul¬ 
tifunction  U  :  Y  — ►  B{A).  Hence,  in  this  context,  an  admissible  policy  is  a  sequence 
7r  =  {irtjter  of  Borel  measurable  stochastic  kernels  irt  on  A  given  Ht  satisfying,  for  all 
I  6  T,  the  constraint 

ftt{U (yt)  |  ht)  =  1 ,  VhtZHt. 

The  set  of  all  admissible  policies  is  again  denoted  by  n. 

Remark  7.1.  In  general,  decisions  take  into  account  past  and  present  information,  and  not 
just  the  last  observation.  Notice  that  the  constraints  on  the  actions  cannot  depend  on  the 
unobservable  component  Xt  of  the  state.  If  this  type  of  constraint  must  be  included  in  the 
model,  then  it  must  be  provided  to  the  controller  as  an  additional  observation.  Similarly,  if 
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the  cost  process  {c(Xt,Yt,  At)}  is  available  to  the  controller,  then  it  should  also  be  regarded 
as  an  additional  component  in  the  observation  process  [50,  p.  201]. 

Remark  7.2.  Quite  often  p  is  specified  as 

p(dx,dy)  =  Qo(dy  |  x)p0(dx), 

where  po  €  'P(X)  is  an  initial  distribution  for  Xo,  and  Q0  is  a  stochastic  kernel  on  Y  given 
X  [15,  Chap.  10],  [80,  Chap.  4]. 

With  p  €  V(XxY)  and  an  admissible  policy  t  specified,  there  exists  a  unique  probability 
measure  V*  on  (ft,B(ft)),  where  ft  :=  ( X  x  Y  x  A)°°,  defined  by 

V [i  ( dxo ,  dyo ,  daQ , . . . ,  dcit—i ,  dx  ^ ,  dyf}  ~ 

p(dx0,dyo)‘x0(daQ  |  if0,  Vo)  P{dxt ,  dyi  |  x0,y0,a0) 

■■■  7rt_i(dat_i  |  ipo, y0,a0, ...  ,yt-\)  P(dxt,dyt  \  xt^ ,  yt.x,  at-i)  . 

7.2.  Transformation  into  a  completely  observable  model.  A  common  approach  in 
the  analysis  of  a  partially  observable  (PO)  model  is  to  construct  a  completely  observable 
(CO)  model,  equivalent  to  the  original  one  in  the  sense  that  corresponding  policies  have 
equal  costs.  The  advantages  in  doing  this  are  obvious,  since  the  theory  of  CO  problems  is 
much  better  developed.  However,  the  price  usually  paid  is  that  the  dimensionality  of  the 
new  state  space  is  substantially  larger  than  that  of  the  original  one. 

Such  an  equivalent  CO  problem  can  be  obtained  in  many  ways.  The  main  idea  is  to 
specify  an  information  state  process  that  summarizes,  at  each  time,  all  relevant  information 
for  decision-making.  Clearly,  Ht  —  (ipo,  To,  Aq,  . . . ,  At-i,  Yj)  can  be  used  as  an  information 
state  process,  but  this  leads  to  a  nonstationary  CO  model,  in  which  “growing  memory” 
difficulties  arise;  see  [15,  Chap.  10].  We  present  here  the  more  standard  approach  where 
the  inferential  knowledge  of  Xt  is  summarized  using  its  conditional  probability  distribution, 
given  the  entire  observed  history  up  to  time  t.  We  first  present  the  construction  of  the 
equivalent  CO  model  for  general  Borel  state  spaces  and  then  specialize  to  models  with 
countable  state  space.  Also,  the  following  assumption  will  be  in  effect  throughout  this 
Section: 

(A7.1)  The  transition  kernel  P( ■  |  x,y,a)  and  the  cost  function  c(x,y,a)  do  not  depend  on 
y ,  and  U(y)  =  A,  for  all  y  £  Y. 

Given  a  PO  model  ( X  x  Y,A,U,P,c),  satisfying  (A7.1),  we  construct  a  CO  model 
('P(X),  A,f/,/C,c)  as  follows.  Let  {'5t, Y<}(6T  and  {Ht}t&T  denote  the  state  process  and 
the  history  spaces  respectively.  The  set  of  admissible  actions  is  selected  by  letting  U (’4>)  =  A, 
for  all  if  €  V(X).  We  define  the  cost  function  c  by 

(7.1)  c(ip,a):=  I  c(x ,  a)  ib(dx) ,  ^G'P(X). 

Jx 

It  remains  to  construct  the  transition  kernel  X.  Working  on  the  canonical  sample  space 
ft  =  (V(X)  x  A)°°  we  first  define  a  stochastic  kernel  q  on  X  x  Y  given  V{X)  X  A,  by 

(7.2)  q(dx,dy  |  if, a)  :=  f  P(dx,  dy  |  x',  a)  ip(dx') ,  ifE'P(X) 

Jx 
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and  decomposing  q ,  we  obtain 


(7.3)  q(dx,dy  |  if), a)  =  Q(dy  \  ip,a)V(dx  |  ij>,a,y) . 

Equation  (7.3)  is  the  filtering  equation.  For  fixed  (^,a),  the  map  y  h  $,  as  defined 
implicitly  in  (7.3),  is  a  measurable  mapping  from  Y  to  V(X).  Consequently,  along  with 
the  distribution  Q  on  Y,  it  induces  a  distribution  1C  on  B( V(X))  which  is  a  measurable 
function  of  (i/>,a),  or,  in  other  words,  a  stochastic  kernel  on  V(X)  given  V(X)  X  A.  It 
follows  that  the  model  (fP(X),  A,/C,c),  with  state  process  {$t}<eT,  forms  a  completely 
observable  controlled  Markov  process,  with  transition  kernel  given  by 

(7.4)  K(B  |  i/>, a)  :=  J  l{V(- \%l>,a,y)  £  B}Q(dy  \ip,a),  B  £  &{V(X))  . 

The  distribution  fia  of  4*o,  corresponding  to  an  initial  distribution  p  of  the  PO  model,  is 
taken  to  be 

(7.5)  fio (B):=  I  p(B,dy),  B£B(V{X)). 

Given  a  history  h%  -  (4>o,yo,---  ,  2/t)  €  Ht  in  the  PO  model  we  can  construct 

ipi,  fa, .  ■ .  in  a  recursive  manner  by  starting  from  ipo  and,  having  obtained  ipt-i ,  solving  for 
in  (7.3),  with  ( 4\a,y )  =  (t/>t-i,at-ifyt)i  and  letting  =  4>.  In  this  manner  we  obtain 
a  corresponding  history  ht  =  (i/?o,ao>---  ,at-i,i*t)  €  Ht  for  the  CO  model;  we  denote  this 
correspondence  by  the  map  gt  :  Ht  —*  Ht-  We  can  then  assign  to  each  admissible  policy 
ir  £  II  in  the  CO  model  a  corresponding  policy  7r  =  p*(7r)  in  the  PO  model,  defined  by 

(7.6)  7r*(*  |  ht)  :=  #<(•  |  gt(ht))  ,  ht£Ht. 

Clearly  every  policy  ir  £  II  can  also  be  regarded  as  a  policy  in  II;  in  other  words,  the  map  g‘ 
is  onto.  If  V 7  is  the  probability  measure  induced  by  the  policy  7r  and  the  initial  distribution 
p  (corresponding  to  p)  on  the  canonical  sample  space  fi,  then  for  each  C  £  &{X) 

(7.7)  V9J^\Xt£C\Ht  =  ht)  =  ^t{C),  V\- a.s. 

Utilizing  (7.1),  (7.4)  and  (7.5),  it  can  be  verified  that 

(7.8)  Ef  ^{c(XuAt)}  =  At)}  ,  Vi€T, 

thus  establishing  that  the  two  models  are  indeed  equivalent  as  claimed.  It  follows  that 
the  process  4/;  summarizes  all  information,  relevant  for  control  purposes,  and  is  called  for 
this  purpose  a  sufficient  statistic  (see  [157],  [158]  and  [49]).  We  define  the  set  of  separated 
policies  Jls  as  those  policies  rr  £  II  for  which  there  a  Markov  policy  7r  on  the  equivalent  CO 
problem  such  that  7r  =  g*(i r),  as  defined  in  (7.6).  In  other  words,  with  f  =  {ft}teT  €  IIm, 
ft  :  'P{X )  — *  'P(A')  and  for  each  initial  distribution  p  €  'P(S'), 

7T((.  |  7It)  =  /<(*<)(•),  Ho~ a.s. 
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Thus,  the  actions  taken  using  a  separated  policy  only  depend  on  Ht  through  the  conditional 
distribution  of  Xt.  In  other  words,  the  following  separation  principle  holds:  if  an  optimal 
policy  exists  in  II,  one  exists  in  IIs-  Hence,  the  process  can  be  controlled  optimally  by 
first  estimating  the  state  via  the  conditional  distribution,  and  choosing  control  actions 
based  solely  on  the  latter.  These  and  other  results,  in  various  degrees  of  generality,  were 
independently  obtained  by  various  authors,  e.g.  [3],  [5],  [87],  [134],  [159],  [147],  [170],  [171], 
[195],  [201]. 

Example  7.1.  A  partially  observable  version  of  the  stochastic  nonlinear  system  in  Exam¬ 
ple  2.1  is  described  by  the  equations 

Xt+i  =  F(Xt,At,Wt) , 

Yt  =  G(Xt,At-.uVt), 

Yo  =  Go(Xo,V0), 


where  G  and  Go  are  Borel  measurable,  and  the  disturbance  {Vt}t^T  is  an  i.i.d.  sequence 
of  random  variables  taking  values  in  a  Borel  space  V,  with  a  common  distribution  Vv\ 
furthermore,  it  is  assumed  that  Xo,  {Wt}  and  {Vi}  are  mutually  independent. 

We  now  specialize  to  the  case  where  the  state  space  AxYisa  finite  or  countably  infinite 
set,  the  action  space  A  is  a  finite  or  compact  set  and  with  assumption  (A7.1)  in  effect.  Thus, 
U(y)  =  A,  for  all  y  €  Y,  and  the  kernel  of  the  process  takes  the  form  P(x',y'  \  x,a).  We 
also  assume  that  the  cost  c  and  the  kernel  P  are  continuous  with  respect  to  a  £  A.  The 
space  V(X)  is  identified  with  the  set  A  of  probability  vectors,  i.e., 

(7.9)  A  :=  {V>€  [0,l]x  :  £  0(x)  =  l) 

x£X 

endowed  with  the  topology  given  by  the  metric 


d{i>  1,-02)  :=  Y  |tMx)  -  tMx)|  =  HVh  -  ^2 111  , 

where  ||  ■  ||i  stands  for  the  standard  fi-norm  on  Rx. 

In  general,  the  recursive  (filtering)  equation  (7.3)  used  to  compute  Vt+i,  is  obtained  via  a 
decomposition  of  measures  technique,  see  [15,  Chap.  10],  [50,  Chap.  8],  [80,  Chap.  4],  [201]. 
This  is  particularly  simple  to  accomplish  (using  Bayes  rule)  when  X  and  Y  are  countable,  or 
when  the  system  is  described  by  a  linear  system  function  and  the  disturbances  are  Gaussian, 
see  [5],  [14],  [100],  [170],  [171].  For  this  purpose,  we  need  the  following  definitions  (compare 
with  (7.2)-(7.3)). 


(7.10) 

(7.11) 

(7.12) 


q(x,y\  :=  Y  P(x,  y  |  x' ,  a)  ip(x') 

x‘£X 


V(y,iJ,a)  :=  Y  ?(x’2/  I  ^’a) 

r£X 


qv(y}^a)  1  ify(y,^,a)/0 
0 ,  otherwise 
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Note  that  the  map  ip  — ►  T(y,ip,a)  maps  A  into  itself.  In  the  countable  case  ipt  can  be 
computed  by  letting  ipt  —  T(yt,  1/4-1, flt-i)-  Here,  V(y,ip,a)  is  interpreted  as  the  (one-step 
ahead)  conditional  probability  of  the  observation  being  y  given  an  a  priori  distribution  ip 
for  the  core  state,  under  decision  a.  Likewise,  T(y,ip,a)  is  interpreted  as  the  a  posteriori 
conditional  probability  distribution  of  the  core  state  given  decision  a  was  made,  observation 
y  obtained,  and  an  a  priori  distribution  ip.  Also,  the  kernel  in  (7.4)  takes  the  form 

(7.13)  K{B  |  iP,  a)  :=  £  V{y,  iP,  a)  I{T{y,  iP,  a)  €  B}  ,  Be  B{  A) , 

yeY 

while  the  cost  c  is  computed  by 

(7.14)  c(ip,a)  :=  ^  c(x,a)ip(x) . 

x£X 


Remark  7.3.  It  is  common  to  specify,  instead  of  the  kernel  P ,  a  transition  kernel  P  on  X 
given  X  x  A  and  an  observation  kernel  Q  on  Y  given  X  x  A  [14],  [61],  [80],  [124],  [166]. 
Note  that  this  is  only  a  special  case  of  our  presentation,  which  happens  when  the  kernel  P 
admits  the  decomposition 


P(x,y  |  x',a)  =  Q(y  |  x,a)P(x  \  x',a). 

In  this  case,  we  can  express  (7.10)— (7.12)  in  a  convenient  vector  form  by  viewing  ip  as 
an  element  of  R-*  and  defining  the  transition  matrix  [^(a)]^,  :=  P(x  |  x',a)  and  the 
observation  matrix  Qy{a)  :=  diag {Q(y  \  x,a)  :  x  e  X}.  Then  with  q  denoting  the  vector 
in  Rx  defined  by  qx(y  \  ip,  a)  :=  q(x,  y  \  ip,  a)  and  l'  =  (1, . . .  ,  1),  we  have 

(7.10')  q{y  |  ip,  a)  =  Qy{a)P(a)ip 

(7.11')  V(y,iP,a)  =  l'Qy(a)P(a)iP 

and  analogously  for  (7.12). 

Note  that  a  nonrandomized  separated  admissible  policy  can  be  viewed  as  a  sequence  of 
maps  7Tj  :  A  — *•  A.  Then  an  equivalent,  completely  observable ,  discounted  cost  problem 
(DC^)  can  be  formulated  as  finding  a  separated  admissible  policy  which  minimizes 


:=  E 


■K 

IpQ 


mt= o 


The  average  cost  problem  (ACr)  is  analogously  defined. 

Note  that  the  one-stage  cost  function  c(ip,a )  is  linear  in  ip  €  A.  It  is  easy  to  show  that 
the  expectation  operator  corresponding  to  the  kernel  1C  preserves  concavity  (convexity)  [49], 
[6].  The  following  results  complement  those  in  Theorem  2.1. 
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Theorem  7.1.  For  a  (DC? )  decision  problem,  Jq(-)  is  a  concave  function,  for  all  0  <  (3  <  1. 
The  DCOE  is  given  by 

(7.15)  >)  =  nun  jc(V>,a)  +  0  £  V(y,ip,a)Jp(T{y,il>,a))\ , 

^  yev  J 

and  any  (nonrandomized)  separated  stationary  policy  which  attains  the  minimum  above  is 
optimal. 

Remark  7.4 ■  The  optimality  equation  (7.15)  is  obtained  from  the  general  theory  of  CMP 
[15],  [80].  For  other  results,  see  [5]— [7] ,  [14],  [49],  [124],  [157],  [166],  [167],  [165]. 

In  this  context,  a  pair  (p,h)  is  said  to  be  a  solution  to  the  ACOE  if,  for  all  rp  €  A, 

(7.16)  p  +  /i(V0  =  min(c(V>,a)+  Y]  V(y,  a)h(T(y,  ip,  a))  } . 

a€A  1  i l&r  J 

7.3.  The  vanishing  discount  approach.  As  shown  in  Section  5,  for  a  countable  state 
space  CMP,  boundedness  conditions  on  the  differential  discounted  value  function  were  suf¬ 
ficient  for  solutions  to  the  corresponding  ACOE  to  exist.  We  consider  here  the  following 
hypothesis: 

(A7.2)  There  exists  a  sequence  f3n  f  1,  such  that  hpn  is  bounded. 

Despite  the  fact  that  the  model  (A,  A,/C,c)  has  a  general  Borel  state  space,  it  has  two 
special  features  which  simplify  the  analysis  via  the  vanishing  discount  method.  The  first  of 
these  features  is  the  concavity  of  the  discounted  value  function  while  the  second  is  the  fact 
that  the  kernel  /C(-  |  i>,a)  vanishes  on  the  complement  of  a  countable  set  (for  fixed  i p  and 
a)  and,  thus,  the  integrals  with  respect  to  1C  reduce  to  infinite  sums. 

For  the  finite  state  and  action  space  case,  the  concavity  of  the  discounted  value  function 
has  been  exploited  by  Platzman  [132]  and  by  Ohnishi  et  al.  [128].  These  authors  utilize 
the  fact  that  a  collection  of  concave  functions  defined  on  some  relatively  open  convex  set 
C  which  are  finite  and  pointwise  bounded,  is  uniformly  bounded  and  equi-Lipschitzian 
relative  to  any  closed  subset  of  C  [139,  Th.  10.6].  Thus,  under  condition  (A7.2),  the  finite 
dimensionality  of  A  and  the  concavity  of  hp(-)  are  used  in  [128],  [132]  to  obtain  a  bounded 
solution  ( p*,h )  to  the  ACOE,  via  the  vanishing  discount  approach.  In  particular,  they 
partition  A  into  its  interior,  its  vertices,  and  its  edges,  i.e., 

A  =  (J  . 

jeJ 

Note  that  \J\  =  2l-xl+1  -  1  and  that  each  set  A  j  is  a  relatively  open  convex  set.  Given  a 
sequence  f3n  f  1,  then  the  concavity  of  hp(-)  and  (A7.2)  axe  used  to  obtain  subsequences 
Pn(j)  such  that  {fya„(j)(-)}  converges  on  A  j.  Platzman  [132]  defines  a  metric  on  A  which 
accomplishes  this  partition.  Let 

I{i>)  :=  {i  6  X  :  t(i)  >0},  €  A  , 

d(C>\ ,  V>2 )  :=  1  -  min  :  €  1(^2 )  j,  €  A, 

D{rjj :=  max{d(^i,02);d(^2,^i)}- 
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In  [131,  pp.  88-89],  Platzman  shows  that  £>(-,-)  is  a  metric  which  leaves  A  disconnected 
and  with  components  identical  to  the  elements  of  the  partition  {A  j}jej-  The  following  is 
shown  in  [132,  Lemma  A.l]. 

Lemma  7.1.  Let  f  :  A  — ♦  R  be  concave  and  bounded  below;  then 

|/(Vh)  -  /(V> 2)|  <  sPan(/)  D{ip\ ,  1^2)  • 


Hence,  under  (A7.2),  {A/3(-)}/3e(o,i)  *s  an  equi-Lipschitzian  family,  with  common  Lipschitz 
constant  given  by  the  (smallest)  uniform  bound,  and  the  Arzela-Ascoli  Theorem  can  be  used 
as  in  [144]  to  obtain  a  bounded  solution  to  the  ACOE. 

If  the  state  space  is  infinite  the  above  method  does  not  work,  simply  because  the  par¬ 
tition  induced  by  the  Platzman  metric  results  in  a  nonseparable  space.  In  this  situation 
the  particular  structure  of  the  kernel  has  been  employed  in  [61]  to  develop  a  theoretical 
framework  based  on  the  notion  of  invariant  subsets  (sub-processes)  of  a  CMP  and  sufficient 
conditions  are  given  for  the  existence  of  solutions  to  the  ACOE,  in  the  case  of  a  finite  action 
space.  The  key  point  is  to  note  that  if  we  let  B(ip,a)  :=  {T(y,ip,a)  :  y  £  Y},  which  is  a 
countable  set  since  Y  is  countable,  then  K,[B(ip,a)  j  V7? a)  =  1-  Thus,  at  any  time  t  £  No, 
the  set  of  possible  next  states  for  is  the  set  [JaeA  -®('k^a)>  which  is  countable,  provided 
A  is  finite.  This  special  structure  has  also  been  identified  by  other  authors,  e.g.  [5,  p.  187], 
[132,  p.  369],  [166,  pp.  19-20], 

We  briefly  summarize  the  work  in  [61].  The  notions  of  descendents,  ancestors  and  rel¬ 
atives  of  a  point  ip  £  A  are  first  introduced.  The  descendents  of  ip  are  defined  as  the 
smallest  subset  of  A  containing  ip  which  is  invariant  under  the  action  of  the  maps  in  the 
collection  [T(y,-,a)  :  y  £Y,a  £  A},  while  the  ancestors  of  ip  are  defined  as  all  the  points 
in  A  which  reach  ip  under  the  application  a  finite  sequence  of  these  maps.  Finally,  the 
relatives  of  a  point  ip,  denoted  by  1Z^\  is  the  set  formed  by  the  union  of  its  descendents 
and  ancestors.  Note  that  the  definition  of  the  descendents  is  an  extension,  to  the  present 
context,  of  Doob’s  concept  of  consequent  sets  [44,  p.  206].  Subsequently,  the  genealogical 
tree  GT ^  of  ip  is  defined  by 

CT„  :=  (J  r';1, 

n€N 

where  the  sets  TZ^  are  defined  recursively  as 


<+1)  :=  1J  ^1},  n£  N. 

se77(n) 

The  descendents  of  a  point  form  a  countable  set,  but  the  ancestors  can  in  general  be 
uncountably  many.  To  guarantee  that  the  relatives  and  hence  the  genealogical  tree  of  a 
point  is  a  countable  set  the  following  condition  is  introduced. 

(A7.3)  For  all  y  £  Y,  a  G  A,  and  ip  £  A,  T~l(y,ip,a)  is  a  countable  set. 

Introduce  the  relation  ip  ~  ip'  if  GT ^  —  GT ^ .  It  follows  that  defines  an  equivalence 
relation  on  A  resulting  in  a  partition  of  A  into  equivalence  classes  which  are  precisely 
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the  sets  GT \p.  Under  conditions  (A7.2)-(A7.3)  the  standard  diagonalization  argument  can 
be  employed  on  each  equivalence  class  GT \p  to  construct  a  pair  (p*,hoT^)  which  solves  the 
ACOE  on  GT ^  (the  boundedness  hypothesis  (A7.2)  can  be  weakened  by  letting  the  constant 
M  depend  on  the  equivalence  class).  Then,  by  defining  h(rp)  :=  hGlv(V’)  for  all  ip  G  A, 
( p“,h )  clearly  solves  the  ACOE  on  A.  One  peculiarity  of  this  approach  is  that  the  resulting 
function  h  is  not  guaranteed  to  be  measurable.  This  is  not  a  major  problem  though,  since 
an  important  consequence  of  the  particular  structure  (with  finite  action  space)  is  that  the 
“measurability  of  various  objects  is  of  no  essential  concern”  for  the  equivalent  problem  [15, 
p.  xi].  The  approach  in  [61]  fails  when  the  action  space  A  is  not  finite. 

Since  the  vanishing  discount  method  relies  heavily  on  the  boundedness  of  the  differential 
discounted  value  function,  the  problem  of  finding  sufficient  conditions  on  the  cost  and  the 
kernel  of  the  process  for  this  to  hold  becomes  an  important  one.  Platzman  [132]  has  given 
(reachability  and  detectability)  conditions  for  (A7.2)  to  hold;  however,  these  conditions  are 
difficult  to  verify.  On  the  other  hand,  many  models  of  interest  possess  special  properties, 
which  allow  the  verification  of  condition  (A7.2)  very  easily. 

Suppose  that  a  partial  order  “-<a”  has  been  defined  on  A,  and  let  “<a”  denote  a  linear 
order  on  A;  we  assume  that  A  is  finite.  We  also  identify  X  with  No  and  endow  it  with  its 
natural  ordering. 

Definition  7.1.  Consider  ((A,  -^aMA,  <a),1C,c)i  and  let  ip\,ip2  €  A.  We  say  that: 

(i)  the  value  functions  are  monotone  if 

V’i  ■‘'A  V2  =*•  )  <  Jp(4’2 ) ,  for  all  0  <  /3  <  1 , 

(ii)  a  (nonrandomized)  stationary  separated  policy  7r  is  monotone  if 

V’l  ^A^  =>  ff(V’l)  ~^A  7r(^2). 

Two  frequently  used  partial  orders  on  A  are  the  stochastic  dominance  -<st  and  the 
monotone  likelihood  ratio  -</r,  defined  below. 

Definition  7.2.  Let  V’l , ip2  €  A;  we  say  that: 

(i)  V’l  -<»t  i>2  if  ’^2'<Pi{i)  <  T.  tMOi  for  all  q  G  X,  and 

i>q  :>g 

(ii)  V’l  -</r  *p2  if  V’i(j)V’2(i)  <  V’l (OWO),  for  all  i,j  €  X  such  that  i  <  j. 

Let  eJ  denote  the  element  of  A  with  the  jth  component  equal  to  1,  j  €  X;  thus,  e.g. 
e°  =  (1, 0, 0, ... ).  The  following  is  easily  shown. 

Lemma  7.2.  If  V’i ,  ip2  €  A  and  V’i  -<ir  V’2,  then  V’l  -<st  ’’P 2  •  Also,  for  all  ip  €  A,  e°  <iT  ip. 

Definition  7.3.  An  action  aj  G  A  is  called  a  reset  action  if,  for  some  j  G  X,  T(y,ip,aj)  = 
eJ,  for  all  y  G  Y  and  ip  £  A. 

A  reset  action  a3  corresponds  to  the  core  state  of  the  system  being  j,  with  probability 
one,  at  the  next  time  epoch  after  action  aj  has  been  taken.  This  type  of  action  arises 
naturally  in  manufacturing  systems  subject  to  inspection,  maintenance,  and  replacement. 
The  following  results  derive  from  the  work  of  Sondik  [166];  see  also  [61]. 
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Lemma  7.3.  If  there  exists  a  reset  action  aj  €  A,  then 


JfiW  ~  JpW)  <  c(tMj),  VV'eA. 


If  X  is  finite,  and  for  each  j  £  X  there  is  a  corresponding  reset  action,  then  for  each 
f3  €  (0, 1)  there  exists  J  £  X  such  that 


0  <W)~Jp(eJ)<M,  VV>€A, 
where  M  :=  max{c(z,a)  |  i  £  X,a  €  A}. 

Remark  7.5.  Note  that  if  Jp(-)  is  monotone  with  respect  to  -<;r,  and  if  there  is  an  action 
ao  €  A  that  resets  the  state  to  e°,  then  0  <  Jp(ifj)-Jp(e°)  <  c(ip,ao)  uniformly  in  /?  €  (0, 1). 
Furthermore,  note  that  when  X  is  finite,  a  constant  M  >  0  exists  such  that  c(i/>,ao)  <  M , 
for  all  V7  G  A,  and  thus  condition  (A7.2)  holds. 

Models  with  a  replacement  action  that  resets  the  system  to  an  “as  new”  state  e°  have 
been  considered  in  [2],  [115],  [127],  [128],  [145],  [187]— [191],  [184],  [185].  Related  problems 
are  those  considered  in  [64],  where  a  reset  action  to  a  most  desirable  state  is  available,  and 
in  [88],  where  (maintenance)  reset  actions  a2  are  available  for  all  j  ^  0,  with  X  a  finite  set. 

7.4.  The  convex  analytic  method.  We  will  now  briefly  describe  Borkar’s  convex  an¬ 
alytic  approach.  The  action  set  A  is  assumed  to  be  any  compact  metric  space.  We  also 
assume  that  c  and  P  are  continuous  in  a.  We  will  consider  the  pathwise  average  cost.  This 
cannot  in  general  be  written  as  an  equivalent  cost  in  terms  of  {'£(},  but  it  is  natural  to 
propose 


(7.17) 


1  T_1 

limsup  —  V  c(¥t,A<) 

T-*  oo  J-  ,_n 


as  a  substitute.  Any  p  6  7>(A  x  A)  can  be  decomposed  as 


(7.18) 


p(dij;,da)  =  p(dij))$(ifi)(da) . 


where  J I  is  the  marginal  of  p  on  A  and  $  is  the  regular  conditional  law  defined  ~p- a.s. 
We  shall  always  work  with  one  arbitrary  representative  of  this  equivalence  class.  Define 
T  C  V(A  X  A)  by 


(7.19)  r 


p  €  V(A  x  A) 

=  Ip  G  T( A  x  A) 


For  p,$  as  in  (7.18),  p  is  invariant  under! 
the  stationary  randomized  policy  $  J 


|  ip' ,a)$(’ij)')(da)p(d'ip') 

-  [  f  dp  ,  for  all  /  6  Cb(A) 


From  (7.19)  one  can  easily  check  that  T  is  closed.  Note  that  the  set  of  invariant  probabil¬ 
ity  measures  for  the  process  {'k*}  controlled  by  a  stationary  randomized  policy  4>,  when 
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nonempty,  need  not  be  a  singleton.  In  general,  it  will  form  a  closed  convex  set  in  'P(A),  the 
extreme  points  of  which  correspond  to  ergodic  measures.  That  is,  the  above  process  with 
one  of  these  extreme  measures  (say,  p )  as  the  initial  condition  will  be  ergodic.  Then  (7.17) 
will  a.s.  equal  f  cdp.  In  view  of  the  ergodic  decomposition  of  a  stationary  Markov  process, 
this  will  also  be  the  case  for  other  invariant  measures  (which  will  be  a  convex  combination 
of  the  ergodic  ones).  Define 


We  assume  that  p*  <  oo.  We  consider  two  alternative  conditions  under  which  the  above 
infimum  will  be  a  minimum. 

(A7.4)  Near-monotone  case:  c  satisfies 

lim  inf  c(i,  a)  =  oo 

i—*oo  a 

(A7.5)  Stable  case.  Condition  (A5.19')  (ii)  holds. 

Observe  that  the  ‘near-monotonicity’  condition  here  is  more  restrictive  than  the  one 
used  in  Section  5.  We  now  state  the  following  result;  the  proof  is  analogous  to  that  of 
Theorem  5.10. 

Lemma  7.4.  Under  either  (A7.4)  or  (.47.5),  the  map  p  t-f  f  cdp  attains  its  minimum  on 

r. 

Define  the  V{A  x  A)- valued  process  by 
r  i  i_1 

fdvt  =  -Y^f(^uAt),  t>  1,  JeCb(AxA), 
d  1  m=  0 

where  {^t}  is  governed  by  some  policy.  Again  we  can  prove  the  following  analogue  of 

Lemma  5.1. 

Lemma  7.5.  With  probability  1,  any  limit  point  of  {pt}  in  V(A  X  A)  lies  in  T. 

Consider  the  near-monotone  case.  Suppose  that  for  a  given  sample  path,  a  subsequence 
of  [pt]  has  no  limit  point  in  V{A  X  A).  Arguments  similar  to  those  in  the  proof  of 
Theorem  5.1  can  be  used  to  show  that  the  cost  must  go  to  +oo  along  this  subsequence.  In 
view  of  Lemma  7.5,  this  leads  to 

1  T_1 

(7.20)  liminf— c($ t, At)  >  p*  ,  a.s. 

T  — »oo  1  ^ — ' 

t=  0 

Along  with  Lemma  7.4,  this  would  seem  to  lead  to  the  existence  of  an  optimal  stationary 
randomized  policy.  There  is,  however,  one  catch.  It  is  not  a  priori  clear  that  any  initial  law 
for  {'!'(}  would  be  in  the  domain  of  attraction  of  the  element(s)  of  T  that  minimize  the  cost 
(or,  for  that  matter,  whether  this  domain  of  attraction  can  be  reached  in  a  finite  random 
time  from  any  initial  law  under  some  policy).  Similar  ‘reachability’  problems  surface  when 
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one  tries  to  extend  the  dynamic  programming  equations.  These  are  circumvented  under 
somewhat  stringent  conditions  in  [132]  as  we  have  already  discussed. 

Finally,  one  can  prove  the  convexity  of  T.  Again  it  is  unclear  how  (and  whether)  one  can 
characterize  the  extreme  points  of  F  as  those  corresponding  to  stationary  (non-randomized) 
policies.  As  the  ACOE  is  not  available  in  this  approach,  the  existence  of  an  optimal  sta¬ 
tionary  policy  remains  an  open  issue  in  general.  In  the  stable  case,  it  is  not  clear  if  (7.20) 
holds  and  thus  this  case  remains  open  to  investigation.  To  sum  up,  the  convex  analytic 
approach  to  POCMP  needs  to  be  further  studied. 

§8.  Multiobjective  and  Constrained  Models 

An  important  success  of  the  convex  analytic  approach  discussed  in  Section  5  is  in  the 
domain  of  multiobjective  problems,  in  which  there  is  more  than  one  cost  (objective)  func¬ 
tion.  We  will  first  consider  a  multiobjective  CMP  with  average  cost  criterion  recast  as  a 
CMP  with  several  constraints.  CMP  with  one  or  multiple  constraints  have  been  studied  in 
[1],  [16],  [26],  [39],  [40],  [43],  [90],  [91],  [94],  [116],  [125],  [141],  [142],  [162],  Our  presentation 
follows  [25],  [26]. 

We  consider  the  case  when  S  =  {0, 1,2, . . .  },  A,  the  action  space,  is  a  prescribed  compact 
metric  space  and  P(j  \  i,a )  is  continuous  in  a  for  fixed  i,j.  Also,  U(i)  =  A  for  all  t  £  S. 
In  the  constrained  CMP  problem  we  have  in  addition  to  the  cost  function  c  £  Cb(S  X  A), 
m  additional  ‘costs’  c,  £  Cj,(S  X  A),  1  <  i  <  m,  and  are  required  to  satisfy 

(8.1)  Gi  ~  J  C'  ,  1  <  i  <  m 

for  prescribed  numbers  6,-  >  a*,  /  £  lisp  and  fj(f )  £  V(S  X  A)  is  as  in  Section  5.  (We 

are  assuming  all  costs  are  bounded  for  the  sake  of  simplicity.  Also,  we  are  confining  our 
attention  to  II55R;  this  suffices  under  reasonable  hypotheses,  as  we  saw  in  Section  5.)  We 
will  assume  condition  (A5.20)  in  Section  5. 

Recall  that  Ir  =  {!)(/)  :  /  €  IIssk}.  Let  Ir  be  the  subset  of  Ir  where  the  constraints 
(8.1)  are  satisfied.  Then  Ir  is  closed  and  convex.  We  assume  also  that  it  is  compact  (this 
will  be  true  under  (A5.20)  in  Section  5).  Under  this  assumption  one  can  show  as  in  Section  5 
that  there  exists  an  /*  £  II5R  which  is  optimal  for  this  problem.  We  will  now  proceed  to 
show  that  /*  requires  randomization  in  at  most  m  states. 

Let  g  £  Cb(S  X  A).  For  some  a  £  R,  let  H  =  Ir  Pi  [ip  :  f  g  dip  <  a},  assumed  to  be 

nonempty.  Clearly  H  is  closed  and  convex.  Let  fj(f)  be  an  extreme  point  of  H.  Suppose 

it  is  not  an  extreme  point  of  Ir  itself.  Then  there  exist  distinct  measures  fj(fu),  fX/12) 
such  that  at  least  one  of  them  (say  fj(fu))  is  not  in  H  and  f/(f)  is  a  convex  combination 
of  the  two.  Suppose  r?(/2i)  £  Ir  \  H ,  fX/22)  another  such  pair.  Then  it  can  be  shown 
that  fj(fij),  1  <  i,j  <  2,  are  collinear  (Ir,  Ir,  H ,  etc.  are  viewed  as  subsets  of  9Jl(S  x  A), 
the  Banach  space  of  finite  signed  measures  on  5  x  A).  Therefore,  all  pairs  of  points  in 
I satisfying:  (a)  at  least  one  of  them  is  not  in  H ,  and  (b)  g(f')  is  a  convex  combination 
thereof,  lie  on  a  single  straight  line  in  DJl(S  x  A).  Let  Z  denote  the  intersection  of  this  line 
with  Ir.  Under  our  hypotheses  on  Ir,  Z  is  a  closed  finite  line  segment.  Let  rj(fi),  gifi) 
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denote  its  end  points.  Then  it  can  be  shown  that  ??(/;),  i  =  1,2  are  extreme  points  of  Ir. 
By  Lemma  5.2,  fi  £  IIsso;  also,  /j  and  f2  are  distinct  since  fj(f)  is  not  an  extreme  point 
of  Jr.  Therefore,  there  exists  an  a'  £  (0,1)  such  that 

f/if)  =  a'fi(fi)  +  (1  -  a')fj(f2) . 


Arguing  as  in  the  proof  of  Lemma  5.2,  it  is  clear  that  for  each  i  £  S  we  may  take  f(i)  to  be 
a  convex  combination  of  fi(i)  and  f2{i).  Let  /  £  Ilsn  be  such  that  for  each  i  £  S,  f(i)  = 
either  /i(i)  or  f2(i).  Then  under  our  hypotheses  ((A5.20)  of  Section  5)  one  can  show  that 
fj(f )  £  Z.  Now  consider  Z  as  a  union  of  two  closed  line  segments  Z\  and  Z2 ,  Z\  being  the 
line  segment  between  t?(/i)  and  f)(/),  and  Z2  that  between  f){f2)  and  r)(/).  Let  {/^}  be  a 
sequence  in  IIso,  defined  as  follows:  /o  =  /i,  and 


hi}),  i  <  n 
/i(0>  *  >  n- 


Then  by  the  above  considerations  fj{f'n)  £  Z.  Since  f'n  — » ►  f2  as  n  — >  oo,  we  conclude  that 
Vif'n)  —  vih)  (the  map  /  r)(/)  is  continuous  under  (A5.19)).  Thus,  the  sequence  ?)(/„), 

n  >  0,  starts  in  Z\  and  eventually  moves  into  Z2.  Let  n  denote  the  first  time  this  happens. 
Then  either  =  fj(f)  or  i)(f)  is  a  convex  combination  of  f)(/n)  and  7)(/n_i).  Since 

fn{i)  —  for  i  ^  n ,  the  arguments  employed  in  Lemma  5.2  show  that  we  may  take 

f(i)  =  the  Dirac  measure  at  f'n(i )  for  i  ^  n  and  f(n)  =  a  suitable  convex  combination  of 
Dirac  measures  at  /i(n)  and  f2(n).  We  have  established  the  following  result. 

Theorem  8.1.  Each  extreme  point  of  H  corresponds  to  an  fj(f)  such  that  f  £  IIsr 
satisfies:  for  all  but  at  most  one  i,  f{i)  is  a  Dirac  measure  at  some  point  of  A.  For  the 
single  remaining  i,  if  any,  f(i)  is  a  convex  combination  of  two  such  Dirac  measures. 

A  variant  of  the  above  theorem  leads  to  the  following  result  [27]. 

Theorem  8.2.  The  minimum  of  1/  J  cdu  on  Ir  is  attained  at  an  fj{f )  £  Jr  where 
f  is  either  deterministic  or  satisfies:  There  are  states  6  S  and  positive  integers 

ni,. .  .,n/c  >  1  such  that  f  requires  randomization  among  nj  values  at  state  ij,  1  <  j  <  k, 
requires  no  randomization  for  the  remaining  states,  and  JA=1  n,  <  m. 

Once  this  existence  result  is  available,  necessary  conditions  for  optimality  can  be  obtained 
from  the  standard  Lagrange  multiplier  theory. 

Theorem  8.3.  There  exist  A,,  /?:  >  0,  1  <  i  <  k  such  that  fj(f )  as  in  Theorem  8.2 
minimizes 


.  k  k 

FivA^APi})  :=  /  cdr)  -  ]T  Xi(bi  -  fcidri)  -  Pi(f  db  ~  ai) 

•*  <=i  i=i 


on  Jr.  Furthermore,  if  Ir  has  nonempty  interior,  the  following  saddle  point  property  holds: 
for  all  X i,f3{  >  0,  1  <  i  <  k,  q  €  Ir 


F{v(f)Axi}APi})  <  F(f)(/),{A,},{/3,})  <  F{rj,  {A,},  {/?,})  . 
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Remark  8.1.  The  result  in  Theorem  8.1  cannot  be  improved  in  general.  Indeed,  in  [26]  there 
is  a  counterexample  to  show  the  non-existence  of  an  optimal  /  £  IIsd  for  the  CMP  with 
one  constraint. 

Remark  8.2.  We  have  discussed  the  stable  case  only.  Analogous  results  can  be  obtained  for 
the  near-monotone  case  (conditions  similar  to  (A5.18)).  For  details,  we  refer  to  [25] . 

Remark  8.3.  When  the  action  set  A  is  countable,  analogous  results  are  obtained  in  [1]. 

We  next  consider  another  multiobjective  CMP  with  AC  criterion.  We  have  m  cost 
functions  c,  £  Cb(S  X  A),  1  <  i  <  m.  All  cost  functions  are  of  equal  importance  and  as  a 
result,  the  optimality  problem  cannot  be  recast  as  a  constrained  one.  Therefore,  we  directly 
deal  with  the  optimality  problem  with  a  vector  cost  criterion.  This  has  been  studied  in  [47], 
[73]. 

Let  Ir  be  compact.  Consider  the  vector  cost  criterion 

I  ci  dfj(f), ...,  J  cm  df){f ,  i?(/ )  €  Ir  . 

In  general,  there  need  not  exist  an  /  £  Ilssii  that  minimizes  all  of  f  Ci  dij(f)  over  Ir.  This 
motivates  the  concept  of  Pareto  optimality.  An  /  6  IIssk  is  said  to  be  Pareto  optimal 
if  there  does  not  exist  any  /  £  IIss/i  for  which  f  Cidfj(f)  <  f  Cidf)(f),  1  <  i  <  m,  with 
inequality  being  strict  for  at  least  one  i.  Pareto  optimality  is  clearly  the  minimal  requirement 
for  any  reasonable  notion  of  an  optimal  solution  for  the  multiobjective  problem  with  no 
priority  among  objectives.  The  Pareto  optimal  solutions  can  be  characterized  as  follows. 

Theorem  8.4.  Any  f  £  II55R  which  minimizes  A,  j  cldf](f)  for  some  A{  >  0,  1  < 
i  <  m,  is  Pareto  optimal.  Conversely,  any  Pareto  optimal  f  £  IIssk  minimizes  the  above 
functional  for  some  choice  of  A,  >  0,  1  <  i  <  m. 

Remark  8.4.  Note  that  the  converse  is  only  partial  since  we  have  A;  >  0  in  place  of  A,  >  0. 
It  becomes  exact  if  S  and  A  are  finite. 

One  often  reduces  a  vector  cost  criterion  as  above  to  a  scalar  one  by  introducing  a 
‘utility  function’.  One  such  case  is  that  of  finding  the  ‘shadow  minimum’  for  the  problem 
of  minimizing  the  vector  cost  u  1— ►  [  f  ci  du , . . . ,  f  cm  dv]  £  Rm  on  Ir.  Letting  L  denote  the 
range  of  this  map,  L  can  be  shown  to  be  closed  and  convex.  Suppose  y*  =  min{/ Ci  du  : 
u  £  Ir },  1  <  i  <  m.  Let  y‘  =  (y[, . . . ,  y^).  The  point  y*  is  called  the  ideal  (or  utopian) 
point.  The  point  x*  £  L  which  is  closest  to  y *  is  called  the  shadow  minimum.  This  point 
is  unique  and  is  characterized  by 


{y*  -  x‘,z  -  x*)  <  0,  z  £  s . 

For  finite  S  and  A,  a  combined  linear-quadratic  program  can  find  x*  explicitly  [73].  The 
point  x*  is  easily  seen  to  be  Pareto  optimal. 
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§9.  Conclusions 


We  hope  this  survey  has  provided  a  useful  presentation  of  the  problems  and  techniques 
in  average  cost  control  of  Markov  processes.  As  is  amply  clear,  there  is  not  a  globally 
applicable  approach.  Instead,  one  expects  to  build  a  library  of  special  tricks,  a  collection 
of  simple  verifiable  sufficient  conditions  under  which  the  problem  is  accessible,  possibly 
with  different  techniques.  Going  one  step  further,  there  are  the  more  difficult  partially 
observable  and  multiobjective  problems.  Though  these  have  seen  some  significant  results  of 
late,  there  remains  much  more  that  eludes  satisfactory  analysis.  A  similar  comment  applies 
to  computational  aspects  and  adaptive  control,  two  topics  we  have  not  touched  upon  here. 
For  computational  aspects  we  refer  to  [79],  [85],  [133],  [176],  and  for  adaptive  control,  [26], 
[80],  [99].  Also,  we  have  not  dealt  with  the  vast  literature  on  sensitive  optimality  [133],  [178], 
nor  with  some  some  other  criteria,  such  as  overtaking  [108],  variance  sensitive  [194],  and 
weighted  cost  [58],  [63],  [96].  Finally,  the  discrete- time  models  have  interesting  applications 
to  continuous-time  problems,  for  which  we  refer  to  [14,  Sect.  6.7],  [106],  [155],  [202]. 
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Appendix 

Multifunctions  and  measurable  selectors.  Let  V  and  W  denote  non-empty  Borel 
spaces,  and  let  2W  denote  the  collection  of  all  nonempty  subsets  of  W .  A  multifunction 
(or  set-valued  function)  $  from  V  to  W  is  a  map  $  :  V  — +  2W .  The  subset  Dom($)  := 
{v  £  V  :  $(u)  0}  is  called  the  domain  of  4>.  When  Dom($)  =  V  we  say  that  the 

map  $  is  strict.  In  what  follows  we  assume  that  $  is  a  strict  multifunction.  If,  for  each 
v  £  V,  $(v)  is  a  compact  (closed,  measurable)  subset  of  W,  then  $  is  said  to  be  compact 
(closed,  measurable) -valued.  A  selector  (or  selection)  of  $  is  a  function  tp  :  V  — ►  W  such 
that  <p(v)  £  $(v),  for  all  v  6  Dom(3>).  The  set  of  (Borel)  measurable  selectors  of  $  will  be 
denoted  by  <S($).  The  graph  of  denoted  by  Graph(<I>),  is  defined  as 

Graph(4>)  :=  {(u,w)  :  v  £  V,  w  £  4>(u)}. 

For  a  set  W  £  2W ,  we  define 

$_1[W]  :=  {v  £  V  :  $(u)  n  W  ±  0}, 

and  we  say  that  $  is  (Borel)  measurable  if  £  B(V),  for  each  closed  subset  B  of 

W .  If  $  is  closed-valued,  then  measurability  of  implies  that  Graph(4>)  £  B(V  x  W); 
furthermore,  if  $  is  compact-valued,  then  the  converse  also  holds  [86],  [180,  Th.  4.2].  The 
multifunction  $  is  called  upper  semicontinuous  if  for  every  v  £  V  and  every  open  set 
G  D  $(v)  there  exists  a  neighborhood  N  of  v  such  that  $(u')  C  G,  for  all  v'  £  jV;  it  is 
called  lower  semicontinuous  if  for  every  v  £  V  and  every  open  set  G  such  that  Gfl$(®)  ^  0, 
$  1(u)  contains  an  open  neighborhood  of  v.  Also,  $  is  said  to  be  continuous  if  it  is  both 
upper  and  lower  semicontinuous. 

The  following  result,  in  different  variations,  has  been  shown  by  several  authors  [15, 
Sect.  7.5],  [46,  Lemma  6,  p.  38],  [50,  Chap.  2],  [86],  [150],  and  summarized  also  in  [80],  [180, 
Th.  9.1]. 

Theorem  A.l.  Let  $  be  a  compact-valued,  measurable,  strict  multifunction  from  V  to 
W.  Let  f  :  Graph($)  — >■  K  be  a  measurable  function,  such  that  for  each  v  £  V ,  f(v,  •) 
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is  lower  semicontinuous  on  $(v).  Then  there  exists  a  measurable  selector  p>‘  £  <S(<I>)  such 
that 

f(v,tp*(v))  =  min  [f(v,w)},  V  v  £  V  . 

w£$(v) 

Let  f*  :  V  —*  R,  defined  by  /*(«)  :=  f (v , p* (v)) .  If  $  is  upper  semicontinuous  and  f 
is  bounded  below,  then  f*  £  C(V).  Also,  if  $  is  continuous  and  f  £  Cb(V  x  W),  then 

r  €  Cb(v). 

A  Tauberian  theorem.  The  following  Tauberian  theorem  plays  a  very  important  role  in 
the  analysis  of  the  average  cost  criterion.  For  its  proof,  which  is  very  hard  to  locate  in  the 
literature  in  this  particular  format,  we  refer  the  reader  to  [172], 

Theorem  A. 2.  Let  {an}  be  a  sequence  of  nonnegative  numbers  and  (5  £  (0,1).  Then 


N- 1 


771  =  0 


71  =  0 


N-l 


<  lim  sup(l  —  (3)  y  f3nan  <  lim  sup  a 

on  ^  n~ oo  N  ^ 


771  —  0 
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